Cleantech RAGΒΆ

By Daniel Perruchoud and George Rowlands

IntroductionΒΆ

This notebook delves into the exciting realm of Cleantech using a dataset of nearly 10,000 news articles from Kaggle, all centered around the energy sector. We'll embark on a journey that includes data exploration, text preprocessing, and culminates in the creation of a Retrieval-Augmented Generation Pipeline (RAG). This powerful approach empowers us to construct an LLM (Large Language Model) that can intelligently answer user queries, drawing upon the knowledge from our curated news articles.

Why RAG? A Cost-Effective and Dynamic SolutionΒΆ

Fine-tuning an LLM can be a resource-intensive and inflexible process. RAG offers a compelling alternative. It leverages a semantic search to pinpoint relevant sections within our news articles that directly address a user's question. These retrieved sections are then provided to the LLM as context, enabling it to deliver informed and insightful responses.

rag_pipline

SetupΒΆ

To run this notebook we recommend downloading the provided GitHub repository and opening this notebook in Google Colab. To ensure a smooth experience, you'll need:

At the start of the notebook a data.zip will be downloaded from a Google Drive and unzipped. This will then provide you with files that contain checkpoints for all of the expensive processing sections such as chunking, generating embeddings and evaluating the pipeline with an LLM as a judge. This saves you money and a lot of time.

If you can't or don't want to run this notebook you can also view the completed notebook by opening the cleantech_rag.html file in your browser.

Unveiling the Depths of RAG PipelinesΒΆ

Throughout this notebook, we'll delve into the intricate workings of RAG pipelines. Prepare to explore:

Questions or Issues? We're Here to Help!

If you encounter any roadblocks or have questions, please don't hesitate to reach out to George Rowlands

Setting your OpenAI KeyΒΆ

This OpenAI Key is used for the following tasks:

%%writefile .env

OPENAI_API_KEY=ENTER_HERE
Overwriting .env

After executing the above cell, you should restart the kernel/runtime to ensure the key is properly set.

Installing DependenciesΒΆ

%%writefile requirements.txt

chromadb==0.5.0
datasets==2.19.1
gdown==5.2.0
kaggle==1.6.1
langchain==0.2.0
langchain-community==0.2.0
langchain-experimental==0.0.59
langchain-openai==0.1.7
langdetect==1.0.9
lorem-text==2.1
nbformat>=4.2.0
plotly==5.22.0
pretty-jupyter==1.0
ragas==0.1.8
seaborn==0.13.2
sentence-transformers==3.0.0
spacy>=3.7
textstat==0.7.3
umap-learn==0.5.5
Overwriting requirements.txt
%pip install torch==2.3.0 --quiet --index-url https://download.pytorch.org/whl/cu121
Note: you may need to restart the kernel to use updated packages.
%pip install -r ./requirements.txt --quiet
Note: you may need to restart the kernel to use updated packages.
import json
import os
import warnings
import zipfile
from collections import Counter
from pathlib import Path
from typing import Dict, List

import chromadb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import torch
from chromadb import Collection, Documents, EmbeddingFunction, Embeddings
from datasets import Dataset
from dotenv import load_dotenv
from langdetect import detect
from lorem_text import lorem
from ragas import RunConfig, evaluate
from ragas.metrics import (faithfulness, answer_relevancy, context_relevancy, answer_correctness)
from spacy.lang.en import English
from textstat import flesch_reading_ease
from tqdm import tqdm
import umap

from langchain.chains.base import Chain
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, VectorStore
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_core.language_models import LLM
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.retrievers import BaseRetriever
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

load_dotenv()
warnings.filterwarnings("ignore")

By removing the first line from the two cells below, you can download the dataset, chunks, embeddings of the chunks and evaluation results from our Google Drive. This will save you time and money.

%%script echo skipping 
!gdown 1MoT_s_Zk4dzRRy7E7Va5ZuTROIOI1FfZ
Couldn't find program: 'echo'
%%script echo skipping 
with zipfile.ZipFile("data.zip", "r") as zip_file:
    zip_file.extractall()
Couldn't find program: 'echo'

Setting up our LLMΒΆ

To make sure our OpenAI Key is working we will test it by generating a response from GPT-4o which we will later on also be using in our RAG pipeline. Try some different prompts or questions to see how the model responds.

llm = ChatOpenAI(model="gpt-4o")
question_prompt = ChatPromptTemplate.from_template(
    "Answer the following question: {question}")
question_chain = question_prompt | llm | StrOutputParser()
question_chain.invoke({"question": "What is the meaning of life?"})
'The question about the meaning of life has been a central philosophical and existential inquiry for centuries, with various interpretations and answers depending on cultural, religious, philosophical, and individual perspectives. Here are a few approaches to consider:\n\n1. **Philosophical Perspective**: Many philosophers have explored this question. For instance, existentialists like Jean-Paul Sartre argue that life has no inherent meaning and it\'s up to individuals to create their own purpose.\n\n2. **Religious Perspective**: Different religions offer varied interpretations. For example, in Christianity, the meaning of life is often seen as living in accordance with God\'s will and seeking salvation. In Buddhism, it involves reaching enlightenment and escaping the cycle of rebirth.\n\n3. **Scientific Perspective**: From a scientific viewpoint, life can be seen as a process of evolution and survival. The "meaning" might be interpreted as the continuation and propagation of life through reproduction and adaptation.\n\n4. **Personal Perspective**: Many people find meaning through personal fulfillment, relationships, achievements, and contributing to the well-being of others. This is often subjective and varies greatly among individuals.\n\nUltimately, the meaning of life might be a combination of these perspectives, and it often depends on personal beliefs, values, and experiences.'

Downloading the Dataset from KaggleΒΆ

We will be exploring the following Cleantech Media Dataset. If you have opened this notebook as recommended by opening the provided Github repository in Google Colab then you don't need to to download the dataset. It should already be under data/bronze. If not then you can either manually download it and upload it into a data/bronze folder or follow the steps below.

Using the Kaggle APIΒΆ

We will be using the Kaggle API to download the data.

To use the Kaggle API you will need a Kaggle account. If you don't already have one, sign up for a Kaggle account at https://www.kaggle.com. When you are logged in, go to the 'Settings' tab of your user profile https://www.kaggle.com/settings and select 'Create New Token'. This will trigger the download of kaggle.json, a file containing your API credentials.

You can then add your Kaggle username and key from the kaggle.json.

data_folder = Path("./data")
if not data_folder.exists():
    data_folder.mkdir()
bronze_folder = data_folder / "bronze"
if not bronze_folder.exists():
    bronze_folder.mkdir()
%%script echo skipping
kaggle_user = "XXXXXXXXXXXXXXXX"
kaggle_key = "XXXXXXXXXXXXXXXX"
Couldn't find program: 'echo'
%%script echo skipping
os.system(f"kaggle datasets download -d jannalipenkova/cleantech-media-dataset -p {bronze_folder}")
Couldn't find program: 'echo'
%%script echo skipping
with zipfile.ZipFile(bronze_folder / "cleantech-media-dataset.zip", "r") as zip_file:
    zip_file.extractall(bronze_folder)
Couldn't find program: 'echo'

Loading the Dataset into DataframesΒΆ

We now load and inspect both the Cleantech Media Dataset and the gold-standard evaluation data provided by our subject matter expert, Janna Lipenkova.

articles_df = pd.read_csv(
    bronze_folder / "cleantech_media_dataset_v2_2024-02-23.csv",
    encoding='utf-8', index_col=0)
articles_df.head()
title date author content domain url
1280 Qatar to Slash Emissions as LNG Expansion Adva... 2021-01-13 NaN ["Qatar Petroleum ( QP) is targeting aggressiv... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
1281 India Launches Its First 700 MW PHWR 2021-01-15 NaN ["β€’ Nuclear Power Corp. of India Ltd. ( NPCIL)... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
1283 New Chapter for US-China Energy Trade 2021-01-20 NaN ["New US President Joe Biden took office this ... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
1284 Japan: Slow Restarts Cast Doubt on 2030 Energy... 2021-01-22 NaN ["The slow pace of Japanese reactor restarts c... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
1285 NYC Pension Funds to Divest Fossil Fuel Shares 2021-01-25 NaN ["Two of New York City's largest pension funds... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
human_eval_df = pd.read_csv(
    bronze_folder / "cleantech_rag_evaluation_data_2024-02-23.csv",
    encoding='utf-8', index_col=0)
human_eval_df.head()
question_id question relevant_chunk article_url
example_id
1 1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... https://www.sgvoice.net/strategy/technology/23...
2 2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... https://www.sgvoice.net/policy/25396/eu-seeks-...
3 2 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... https://www.pv-magazine.com/2023/02/02/europea...
4 3 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... https://www.sgvoice.net/policy/25396/eu-seeks-...
5 4 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... https://cleantechnica.com/2023/05/08/general-m...

Explorative Data Analysis & PreprocessingΒΆ

As the saying goes, "garbage in, garbage out." In the realm of machine learning, the quality of our outputs hinges on the quality of our inputs. This section delves into the essential processes of Exploratory Data Analysis (EDA) and data preprocessing. Through EDA, we'll illuminate the characteristics, patterns, and potential quirks residing within our cleantech news article dataset. Preprocessing will ensure our data is cleansed, structured, and prepared to be effectively utilized by the RAG pipeline, laying the foundation for high-quality results.

Let us start by gaining an overview of the datasets features (columns).

articles_df.describe()
title date author content domain url
count 9593 9593 31 9593 9593 9593
unique 9569 967 7 9588 19 9593
top Cleantech Thought Leaders Series 2023-05-04 Michael Holder ['Geopolitics as much as price or quality will... cleantechnica https://www.energyintel.com/0000017b-a7dc-de4c...
freq 5 427 8 2 1861 1
articles_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9593 entries, 1280 to 81816
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    9593 non-null   object
 1   date     9593 non-null   object
 2   author   31 non-null     object
 3   content  9593 non-null   object
 4   domain   9593 non-null   object
 5   url      9593 non-null   object
dtypes: object(6)
memory usage: 524.6+ KB

Our initial exploration reveals that the "author" column only contains data for 31 out of 9593 articles. Since this offers minimal information gain, we can remove this feature.

We've also observed that some titles and content entries appear to be non-unique. This might necessitate identifying and removing duplicate entries.

On a positive note, the article URLs are all unique, potentially serving as suitable unique identifiers for the data.

articles_df = articles_df.drop(columns=["author"])

Article DomainsΒΆ

The dataset helpfully provides the domain names extracted from the article URLs. These domains essentially represent the publishers of the news articles. Let's analyze the distribution of publishers and see how many articles each publisher has contributed.

domain_counts = articles_df["domain"].value_counts()
domain_counts
domain
cleantechnica            1861
azocleantech             1627
pv-magazine              1206
energyvoice              1017
solarindustrymag          673
naturalgasintel           658
thinkgeoenergy            645
rechargenews              559
solarpowerworldonline     505
energyintel               234
pv-tech                   232
businessgreen             158
greenprophet               80
ecofriend                  38
solarpowerportal.co        34
eurosolar                  28
decarbxpo                  19
solarquarter               17
indorenergy                 2
Name: count, dtype: int64

A visualization helps us to understand the skew in the data.

barplot = sns.barplot(
    x=domain_counts.values, 
    y=domain_counts.index,
    hue=domain_counts.index
)

barplot.set_title('Article Counts by Domain')
barplot.set_xlabel('Article Count')
barplot.set_ylabel('Domain')

plt.show()
No description has been provided for this image

Our exploration of article domains reveals a skewed distribution. Publishers like cleantechnica have a significantly higher representation (1861 articles), while others like indoenergy have minimal contributions (2 articles). If we proceed with sampling this data, this imbalance should be taken into account. Stratified sampling is the recommended approach to ensure a representative sample across different publishers.

Article DatesΒΆ

Each article within the dataset is accompanied by a publication date. Let's delve into the temporal range of these articles and investigate any noteworthy patterns in publication trends.

# plot the amount of articles over time
articles_df["date"] = pd.to_datetime(articles_df["date"])
time_df = articles_df.groupby("date").size().reset_index()
time_df.columns = ["date","count"]

time_df.describe()
date count
count 967 967.000000
mean 2022-06-01 19:11:06.390899456 9.920372
min 2021-01-01 00:00:00 1.000000
25% 2021-09-11 12:00:00 4.000000
50% 2022-06-06 00:00:00 9.000000
75% 2023-02-14 12:00:00 13.000000
max 2023-12-05 00:00:00 427.000000
std NaN 15.206340
sns.lineplot(data=time_df, x="date", y="count")
plt.title("Article Count Over Time")
plt.xlabel("Date")
plt.xticks(rotation=90)
plt.ylabel("Article Count")
# add a line for the average
avg_count = time_df["count"].mean()
plt.axhline(avg_count, color='r', linestyle='--', label=f"Average article count per day: {avg_count:.2f}")
plt.legend()
plt.show()
No description has been provided for this image

While the daily article count appears consistent overall, a significant outlier disrupts the pattern on the 2023-12-05. The cause of this outlier is undetermined, but it could potentially be the date the data was scraped and the default value assigned for missing dates. Since the publication date is not crucial for RAG pipeline, we can remove it.

articles_df = articles_df.drop(columns=["date"])

Article TitlesΒΆ

As noted in our initial exploration, some articles share identical titles. Here, we'll focus on identifying and handling these duplicate titles to ensure a clean and consistent dataset for our RAG pipeline.

sns.histplot(articles_df["title"].str.len())
plt.title("Title Length Distribution")
plt.xlabel("Title Length")
plt.ylabel("Count")
avg_count = articles_df["title"].str.len().mean()
plt.axvline(avg_count, color='r', linestyle='--', label=f"Average title length: {avg_count:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
articles_df["title"].duplicated().sum()
24
duplicate_titles = articles_df[articles_df["title"].duplicated(keep=False)].sort_values("title")
duplicate_titles.head(10)
title content domain url
6654 Aberdeen’ s NZTC plans national centre for geo... ['Aberdeen’ s NZTC is planning a national cent... energyvoice https://www.energyvoice.com/renewables-energy-...
6660 Aberdeen’ s NZTC plans national centre for geo... ['Aberdeen’ s NZTC is planning a national cent... energyvoice https://sgvoice.energyvoice.com/strategy/techn...
38593 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cross
38599 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cro...
38596 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cro...
38598 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cro...
38597 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cro...
6704 BEIS mulls ringfenced CfD support for geotherm... ['Ministers are considering whether geothermal... energyvoice https://sgvoice.energyvoice.com/policy/21121/b...
6702 BEIS mulls ringfenced CfD support for geotherm... ['Ministers are considering whether geothermal... energyvoice https://www.energyvoice.com/renewables-energy-...
37040 Cleantech Insights from Industry Series ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/Insights.aspx?page=2
duplicate_titles["content"].duplicated().sum()
0

Our exploration identified 24 titles that appear multiple times in the dataset. Examples include "About David J. Cross." Interestingly, while the titles are identical, the content itself appears to be unique.

Here are some additional observations for further investigation:

def wrap_text(text: str, char_per_line=100) -> str:
    # for better readability, wrap the text at the last space before the char_per_line
    if len(text) < char_per_line:
        return text
    else:
        return text[:char_per_line].rsplit(' ', 1)[0] + '\n' + wrap_text(text[len(text[:char_per_line].rsplit(' ', 1)[0])+1:], char_per_line)
print(duplicate_titles.iloc[0]["title"])
print(wrap_text(duplicate_titles.iloc[0]["content"]))
Aberdeen’ s NZTC plans national centre for geothermal energy
['Aberdeen’ s NZTC is planning a national centre to accelerate geothermal energy in the UK and
become the β€œ go-to ” hub globally for the renewable technology.', 'Calum Watson, senior project
engineer at the Net Zero Technology Centre, has outlined ambitions for the oil and gas industry to
help ramp up the clean energy, both onshore and in the North Sea.', 'The NZTC’ s new β€œ National
Geothermal Innovation Centre ” would develop technology and help create β€œ bespoke regulation ” for
geothermal, with the aim of it providing 5% of UK energy needs by 2030.', 'By 2050, Mr Watson said
geothermal could account for 20% of Britain’ s energy mix, slashing carbon emissions in the
process.', 'Geothermal is a burgeoning technology – which has been picked up in some countries like
Iceland and the Philippines – which harnesses heat in the subsurface of the earth to generate
electricity.', 'Some barriers to its uptake include expensive up-front costs like exploration and
drilling.', 'However a report published this week by trade body Offshore Energies UK said there are
2,100 offshore oil and gas wells to be decommissioned in the North Sea next decade – which Mr
Watson described as a β€œ massive opportunity ” for geothermal', 'Based at a β€œ north-east location ”,
the new hub would be the β€œ go to centre globally for geothermal technology challenges but,
crucially, would be world-leading in supporting government, and creating legislation and best
practice for geothermal ”.', 'Speaking at the Offshore Decommissioning Conference in St Andrews on
Tuesday, Mr Watson did not disclose whether the plan had backers or when it might be set up.', 'He
said it would be achieved through a β€œ partner-led roadmap ” akin to the NZTC itself – which is
funded with Β£180m of UK and Scottish Government funding – and ultimately be powered by geothermal
energy.', 'The national base would comprise a β€œ solution centre ” to scale up technologies from
pilot stage.', 'It would also have a knowledge hub to share learnings and an β€œ accelerator
programme ” to fund start-ups.', 'The NZTC has already dipped its toe into the tech – supporting a
β€œ first of its kind ” test project for the EnQuest Magnus platform in the North Sea.', 'Mr Watson
set out his hopes for what the centre could achieve by 2030, and highlighted the opportunity for
oil and gas workers to transfer to the sustainable technology.', 'β€œ ( By 2030) we want the centre
to have delivered geothermal energy, accounting for 5% of the UK’ s energy mix and on route for 20%
by 2050.', 'β€œ We would have multiple demonstrators successfully delivered to showcase and educate
and, long term, the center will be run on geothermal energy.']
print(duplicate_titles.iloc[1]["title"])
print(wrap_text(duplicate_titles.iloc[1]["content"]))
Aberdeen’ s NZTC plans national centre for geothermal energy
['Aberdeen’ s NZTC is planning a national centre to accelerate geothermal energy in the UK and
become the β€œ go-to ” hub globally for the renewable technology.', 'Calum Watson, senior project
engineer at the Net Zero Technology Centre, has outlined ambitions for the oil and gas industry to
help ramp up the clean energy, both onshore and in the North Sea.', 'The NZTC’ s new β€œ National
Geothermal Innovation Centre ” would develop technology and help create β€œ bespoke regulation ” for
geothermal, with the aim of it providing 5% of UK energy needs by 2030.', 'By 2050, Mr Watson said
geothermal could account for 20% of Britain’ s energy mix, slashing carbon emissions in the
process.', 'Geothermal is a burgeoning technology – which has been picked up in some countries like
Iceland and the Philippines – which harnesses heat in the subsurface of the earth to generate
electricity.', 'Some barriers to its uptake include expensive up-front costs like exploration and
drilling.', 'However a report published this week by trade body Offshore Energies UK said there are
2,100 offshore oil and gas wells to be decommissioned in the North Sea next decade – which Mr
Watson described as a β€œ massive opportunity ” for geothermal', 'Based at a β€œ north-east location ”,
the new hub would be the β€œ go to centre globally for geothermal technology challenges but,
crucially, would be world-leading in supporting government, and creating legislation and best
practice for geothermal ”.', 'Speaking at the Offshore Decommissioning Conference in St Andrews on
Tuesday, Mr Watson did not disclose whether the plan had backers or when it might be set up.', 'He
said it would be achieved through a β€œ partner-led roadmap ” akin to the NZTC itself – which is
funded with Β£180m of UK and Scottish Government funding – and ultimately be powered by geothermal
energy.', 'The national base would comprise a β€œ solution centre ” to scale up technologies from
pilot stage.', 'It would also have a knowledge hub to share learnings and an β€œ accelerator
programme ” to fund start-ups.', 'The NZTC has already dipped its toe into the tech – supporting a
β€œ first of its kind ” test project for the EnQuest Magnus platform in the North Sea.', 'Mr Watson
set out his hopes for what the centre could achieve by 2030, and highlighted the opportunity for
oil and gas workers to transfer to the sustainable technology.', 'β€œ ( By 2030) we want the centre
to have delivered geothermal energy, accounting for 5% of the UK’ s energy mix and on route for 20%
by 2050.']

Our analysis suggests potential redundancy within certain articles. In some cases, the second article might appear to be the first article with an additional sentence appended at the end.

Let's take a closer look at these "energyvoice" articles and how the contents start and see if we can eliminate these redundancies.

energyvoice_articles = articles_df[articles_df["domain"].str.contains("energyvoice")]
energyvoice_articles.content.map(lambda x: x[:50]).value_counts()
content
['', '', 'The Megawatt Hour is the latest podcast     6
['A group of trade associations from across the en    3
['Two years after the Amazon Pledge Fund invested     3
['The latest analysis shows that capital flows tow    2
['Macquarie Group is betting the North Sea – engin    2
                                                     ..
['Now more than ever – in terms of cost and the im    1
['Scientists have hailed a helium discovery which     1
['Marine equipment fabrication and rental speciali    1
['The Russian powers behind oil explorers Exillon     1
['Aberdeen-headquartered Repsol Sinopec Resources     1
Name: count, Length: 980, dtype: int64
def remove_prefix_articles(df: pd.DataFrame, prefix_len: int = 100) -> pd.DataFrame:
    """
    Takes O(n^2) time complexity
    If the first {prefix_len} characters of the article are the same, then we consider them as a prefix. 
    If an article is a prefix of a longer article, then we remove it.
    If an article is a prefix of longer article, but they have different titles, then we keep them.
    """

    df["char_len"] = df["content"].map(len)
    df = df.sort_values(by='char_len', ascending=True).reset_index(drop=True)

    # Initialize a list to keep the articles that are not prefixes of others
    non_prefix_articles = []

    for i, row in df.iterrows():
        is_prefix = False
        content_i = row['content'][:prefix_len]
        title_i = row['title']

        for j in range(i + 1, len(df)):
            content_j = df.at[j, 'content'][:prefix_len]
            title_j = df.at[j, 'title']

            if content_i == content_j:
                # If the prefix matches but the titles are different, we keep it
                if title_i != title_j:
                    continue
                else:
                    is_prefix = True
                    break

        if not is_prefix:
            non_prefix_articles.append(row)

    print(f"Removed {len(df) - len(non_prefix_articles)} prefix articles")
    return pd.DataFrame(non_prefix_articles)
energyvoice_articles = remove_prefix_articles(energyvoice_articles)
energyvoice_articles.content.map(lambda x: x[:100]).value_counts()
Removed 11 prefix articles
content
['', '', 'The Megawatt Hour is the latest podcast boxset brought to you by Energy Voice Out Loud in     6
['Two years after the Amazon Pledge Fund invested in Hippo Harvest, the company is selling its first    3
['A group of trade associations from across the energy sector have written to the Chancellor urging     3
['Global Port Services has confirmed the award of multiple contracts in support of the Seagreen wind    2
['DNV report shows Jotun’ s Baltoflake solution offers beyond 30 years’ protection for offshore asse    2
                                                                                                       ..
['The deal volume for renewable energy assets in Asia more than tripled to $ 13.6 billion in 2021, a    1
['Several young energy professionals have undertaken a voyage across Scotland to spotlight the count    1
['A UK-backed research group unveiled a design for a liquid hydrogen-powered airliner theoretically     1
['UK-listed Pharos Energy is excited about its upcoming Vietnam activities with a 3D seismic shoot l    1
['With the greatest and most urgent energy transition in human history accelerating, the quest for n    1
Name: count, Length: 981, dtype: int64

There still seem to be be some redundancy, but we did manage to remove 11 duplicates.

Article ContentsΒΆ

Having explored various aspects of our dataset, we now turn our attention to the heart of the matter: the article content itself. This section will delve into the analysis and preprocessing techniques we'll employ to ensure the content is high-quality and effectively utilized by our RAG pipeline.

We start with a visual inspection of the article content.

np.random.seed(7)
random_sample_id = np.random.choice(articles_df.index)
print(wrap_text(articles_df.loc[random_sample_id, "content"]))
['Enphase Energy Inc., a supplier of microinverter-based solar and battery systems, says its
partner Lumio will be significantly expanding its offering of Enphase IQ8 microinverters and IQ
batteries to customers across the United States.', 'The strategic relationship with Lumio will
amplify the impact and distribution of Enphase systems, providing homeowners more access to
reliable, sustainable and grid-independent power sources, the company says.', 'β€œ We are excited
about Enphase’ s full suite of products – including microinverters, batteries and EV chargers –
that can provide our customers best-in-class home energy management solutions, ” says Greg
Butterfield, CEO at Lumio. β€œ Additionally, the Enphase digital platform, from lead generation to
permitting to ongoing operations and maintenance services, offers a unique ability for Lumio to
increase efficiencies and reduce costs. ”', 'For homeowners who want battery backup, there are no
sizing restrictions on pairing Enphase IQ batteries with IQ8 microinverters, and the Sunlight Jump
Start feature can restart a home energy system – switching to sunlight-only after prolonged grid
outages that may result in a fully depleted battery. This eliminates the need for a manual restart
of the system and gives homeowners greater assurance of energy resilience.', 'β€œ This strategic
relationship with Enphase makes it easier for Lumio’ s customers to take control of their power
production, power consumption, and increase the security and reliability of their family’ s power
supply, ” adds David Schonberg, senior vice president of energy partnerships at Lumio.', 'Solar
Industry offers industry participants probing, comprehensive assessments of the technology, tools
and trends that are driving this dynamic energy sector. From raw materials straight through to
end-user applications, we capture and analyze the critical details that help professionals stay
current and navigate the solar market.', 'Β© Copyright Zackin Publications Inc. All Rights
Reserved.']

Our initial examination reveals that article content is currently stored as a list of strings. To gain deeper understanding and facilitate preprocessing, we'll transform these lists into a more cohesive textual format.

articles_df['article'] = articles_df['content'].apply(lambda x: ' '.join(eval(x)))
print(wrap_text(articles_df.loc[random_sample_id, "article"]))
Enphase Energy Inc., a supplier of microinverter-based solar and battery systems, says its partner
Lumio will be significantly expanding its offering of Enphase IQ8 microinverters and IQ batteries
to customers across the United States. The strategic relationship with Lumio will amplify the
impact and distribution of Enphase systems, providing homeowners more access to reliable,
sustainable and grid-independent power sources, the company says. β€œ We are excited about Enphase’ s
full suite of products – including microinverters, batteries and EV chargers – that can provide our
customers best-in-class home energy management solutions, ” says Greg Butterfield, CEO at Lumio. β€œ
Additionally, the Enphase digital platform, from lead generation to permitting to ongoing
operations and maintenance services, offers a unique ability for Lumio to increase efficiencies and
reduce costs. ” For homeowners who want battery backup, there are no sizing restrictions on pairing
Enphase IQ batteries with IQ8 microinverters, and the Sunlight Jump Start feature can restart a
home energy system – switching to sunlight-only after prolonged grid outages that may result in a
fully depleted battery. This eliminates the need for a manual restart of the system and gives
homeowners greater assurance of energy resilience. β€œ This strategic relationship with Enphase makes
it easier for Lumio’ s customers to take control of their power production, power consumption, and
increase the security and reliability of their family’ s power supply, ” adds David Schonberg,
senior vice president of energy partnerships at Lumio. Solar Industry offers industry participants
probing, comprehensive assessments of the technology, tools and trends that are driving this
dynamic energy sector. From raw materials straight through to end-user applications, we capture and
analyze the critical details that help professionals stay current and navigate the solar market. Β©
Copyright Zackin Publications Inc. All Rights Reserved.
articles_df["article"].duplicated().sum()
5
duplicate_articles = articles_df[articles_df["article"].duplicated(keep=False)].sort_values("article")
duplicate_articles
title content domain url article
78215 China's wind giants are chasing global growth:... ['Geopolitics as much as price or quality will... rechargenews https://www.rechargenews.com/wind/chinas-wind-... Geopolitics as much as price or quality will d...
78216 Why geopolitics will set the limits of China's... ['Geopolitics as much as price or quality will... rechargenews https://www.rechargenews.com/wind/why-geopolit... Geopolitics as much as price or quality will d...
80067 Sodium-ion battery production capacity to grow... ['Global demand for sodium-ion batteries is ex... pv-magazine https://www.pv-magazine.com/2023/07/17/sodium-... Global demand for sodium-ion batteries is expe...
80073 Sodium-ion battery fleet to grow to 10 GWh by ... ['Global demand for sodium-ion batteries is ex... pv-magazine https://www.pv-magazine.com/2023/07/17/sodium-... Global demand for sodium-ion batteries is expe...
6685 Indonesia seeks investors for giant geothermal... ['Indonesia, home to the world’ s largest geot... energyvoice https://www.energyvoice.com/oilandgas/467719/i... Indonesia, home to the world’ s largest geothe...
6689 Indonesia seeks investors for giant geothermal... ['Indonesia, home to the world’ s largest geot... energyvoice https://sgvoice.energyvoice.com/investing/2002... Indonesia, home to the world’ s largest geothe...
78225 Quest for endless green energy from Earth's co... ['One of Japan’ s largest utility groups Chubu... rechargenews https://www.rechargenews.com/energy-transition... One of Japan’ s largest utility groups Chubu E...
78227 Limitless green energy from Earth's core quest... ['One of Japan’ s largest utility groups Chubu... rechargenews https://www.rechargenews.com/news/2-1-1487279 One of Japan’ s largest utility groups Chubu E...
78210 Portugal energy transition plan targets massiv... ['Portugal has more than doubled its 2030 goal... rechargenews https://www.rechargenews.com/energy-transition... Portugal has more than doubled its 2030 goals ...
78212 Wind, hydrogen and solar fused in Portugal's p... ['Portugal has more than doubled its 2030 goal... rechargenews https://www.rechargenews.com/energy-transition... Portugal has more than doubled its 2030 goals ...

Our analysis uncovers additional insights regarding content duplication. We observe cases where seemingly identical articles are reposted on the same domain but with different titles (excluding the "sgvoice.energyvoice.com" vs. "energyvoice.com" scenario previously addressed). Here, we'll strategically keep these duplicates where contents are the same but titles are different.

Importance of Titles

We keep these duplicate articles because titles can hold information relevant for our RAG pipeline. Consider a scenario where a user query uses an abbreviation, while the corresponding article only contains the abbreviation in the title, in the content always the full term is used. To bridge this gap, we'll prepend titles to the article content during preprocessing. This ensures that the retrieval process considers not only the content itself, but also the potentially informative titles.

Next Step

As previously noted, some articles exhibit standardized introductions, possibly artifacts of the data scraping process. We'll develop appropriate techniques to handle these introductions during preprocessing, ensuring they don't hinder the effectiveness of our RAG pipeline.

articles_df.article.map(lambda x: x[:50]).value_counts()
article
By clicking `` Allow All '' you agree to the stori    1627
Sign in to get the best natural gas news and data.     658
window.dojoRequire ( [ `` mojo/signup-forms/Loader      52
None of these red flags by themselves make a compa      19
Volkswagen ID.4 sales were up 254% in the 1st quar      14
                                                      ... 
You want to invest in renewable energy or a better       1
The best way to deal with carbon is not to release       1
When there is deflation, the prices of goods in th       1
Stickers are excellent products to leverage in bot       1
Arevon Energy Inc. has closed financing on the Vik       1
Name: count, Length: 6765, dtype: int64
artifacts = [
    "By clicking `` Allow All '' you agree to the sto",
    "Sign in to get the best natural gas news and dat",
    "window.dojoRequire ( [ `` mojo/signup-forms/Load"
]

for artifact in artifacts:
    print(wrap_text(articles_df[articles_df.article.str.startswith(artifact)].article.iloc[0][:500]))
    print()
By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site
navigation, analyse site usage and support us in providing free open access scientific content.
More info. Nel Hydrogen is committed to pushing the boundaries of science and continues to support
the research and development of new and innovative technologies. A group of leading researchers and
two employees of Proton Energy Systems, Inc., a subsidiary of Nel ASA ( Nel Hydrogen) have recently
published 

Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily
emails. Your email address * Your password * Remember me Continue Reset password Featured Content
News & Data Services Client Support Bidweek Markets | Natural Gas Prices | NGI All News Access
Major fluctuations in the latest weather models resulted in big swings in natural gas bidweek
prices, with solid gains on the East Coast and out West. However, much of the country’ s midsection
posted hefty 

window.dojoRequire ( [ `` mojo/signup-forms/Loader '' ], function ( L) { L.start ( { `` baseUrl '':
'' mc.us4.list-manage.com '', '' uuid '': '' 2a6df7ce0f3230ba1f5efe12c '', '' lid '': '' 1e23cc3ebd
'', '' uniqueMethods '': true }) }) American consumers are more concerned about the planet than
steady economic growth, new report. Your company wants to be a part of this. What steps do you
take? Each company should create detailed reports that evaluate the environmental impact of the
business, num

def remove_scrapping_artifacts(df: pd.DataFrame, column: str) -> pd.DataFrame:
    text_artifacts = [
        "By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site navigation, analyse site usage and support us in providing free open access scientific content. More info.",
        "Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily emails. Your email address * Your password * Remember me Continue Reset password Featured Content News & Data Services Client Support"
    ]

    regex_artifacts = [
        r"window.dojoRequire \( \[ .*\}\) \}\) "
    ]

    for pattern in text_artifacts:
        articles_df[column] = articles_df[column].str.replace(pattern, '', regex=False)

    for pattern in regex_artifacts:
        articles_df[column] = articles_df[column].str.replace(pattern, '', regex=True)

    return df
articles_df = remove_scrapping_artifacts(articles_df, "article")
articles_df.article.map(lambda x: x[:50]).value_counts()
article
 Daily GPI Energy Transition | Infrastructure | NG    38
 Daily GPI E & P | NGI All News Access The U.S. na    36
 Daily GPI Energy Transition | NGI All News Access    28
None of these red flags by themselves make a compa    19
 Daily GPI Markets | Natural Gas Prices | NGI All     17
                                                      ..
 Award winning cleantech firm Aceleron’ s repairab     1
 Generating safe, green energy is one thing but pr     1
 Countries around the world need to move further a     1
 The sun is arguably the most important renewable      1
Arevon Energy Inc. has closed financing on the Vik     1
Name: count, Length: 8749, dtype: int64

Our efforts have successfully eliminated a substantial portion of the scrapping artifacts within the articles. However, some traces still persist, likely remnants of past website navigation structures. While removing these remaining artifacts could offer further refinement, it also presents a significant challenge. Therefore, we'll acknowledge this for now and move onto further preprocessing such as filtering out articles that are not written in English.

articles_df["lang"] = articles_df["article"].map(detect)
articles_df["lang"].value_counts()
lang
en    9588
de       4
ru       1
Name: count, dtype: int64

We will first inspect the language-specific assessment of our texts.

articles_df[articles_df["lang"] != "en"]
title content domain url article lang
8283 International Energy Storage Conference ( IRES... ['EUROSOLAR veranstaltet vom 16. bis 18. MΓ€rz ... eurosolar https://www.eurosolar.de/2021/01/26/internatio... EUROSOLAR veranstaltet vom 16. bis 18. MΓ€rz 20... de
8304 Open Letter to Presidents Putin, Biden, Zelens... ['EUROSOLAR, the European Association for Rene... eurosolar https://www.eurosolar.de/sektionen/russland/ EUROSOLAR, the European Association for Renewa... ru
8307 Internationale Konferenz fΓΌr Energiespeicher m... ['Die nun zu Ende gegangene β€ž Internationale E... eurosolar https://www.eurosolar.de/2022/09/26/internatio... Die nun zu Ende gegangene β€ž Internationale Ern... de
8308 Presentations, Poster and Photos of the IRES 2022 ['Photos from the IRES ( Copyright EUROSOLAR e... eurosolar https://www.eurosolar.de/2022/10/20/presentati... Photos from the IRES ( Copyright EUROSOLAR e.V... de
24652 SMS group liefert Prozesstechnologie fΓΌr das e... ['Β© SMS group liefert Prozesstechnologie fΓΌr d... decarbxpo https://www.decarbxpo.com/en/News_Media/Magazi... Β© SMS group liefert Prozesstechnologie fΓΌr das... de
print(wrap_text(articles_df[articles_df["lang"] != "en"].iloc[1]["article"][1000:]))
 suffering and misery for over a century, while distracting from the one common enemy threatening
to consume all: accelerated fossil fueled climate heating. The Ukraine’ s EUROSOLAR section and its
networks have long advocated a new age with renewable energy in Eastern Europe. Together with all
of our other sections and members across the European continent, from Russia to the Netherlands,
and from Turkey to Denmark, EUROSOLAR offers this Climate Peace Platform. Prof. Peter Droege,
President of EUROSOLAR: β€œ The time has come for Climate Peace Diplomacy, to confront everyone’ s
common enemy: advanced fossil climate destabilization. This is one of ten actions presented by
EUROSOLAR as the main agenda of our time. ” Dr. Brigitte Schmidt, Vice President and Board Member
of EUROSOLAR Germany: β€˜ The time for renewable peace has come, part of our Regenerative Earth
Decade program. It stands for rethinking and peaceful action for our common future.’ Since its very
foundation in 1988 EUROSOLAR has worked to end fossil fuel wars through the great switch to 100%
renewable energy. In the words of Hermann Scheer ( 1944-2010), founder of EUROSOLAR: β€˜ Renewable
energies build peace’. The age of fossil-nuclear threats must end, the existential focus must
begin: www.earthdecade.org. EUROSOLAR also calls for a shift in thinking towards climate peace
diplomacy that recognizes and combats fossil dependencies as humanity’ s greatest common enemy.
https:
//www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom
cy/ Π’Ρ–Π΄ΠΊΡ€ΠΈΡ‚ΠΈΠΉ лист ΠΏΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚Π°ΠΌ ΠŸΡƒΡ‚Ρ–Π½Ρƒ, Π‘Π°ΠΉΠ΄Π΅Π½, Π—Π΅Π»Π΅Π½ΡΡŒΠΊΠΈΠΉ Ρ– Π›ΡƒΠΊΠ°ΡˆΠ΅Π½ΠΊΠΎ: Eurosolar, Π„Π²Ρ€ΠΎΠΏΠ΅ΠΉΡΡŒΠΊΠ°
асоціація Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½ΠΎΡ— Π΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠΈ, Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ” Π΄ΠΎ Π½Π΅Π³Π°ΠΉΠ½ΠΎΠ³ΠΎ припинСння вогню Ρ‚Π° постійної ΠΌΠΈΡ€Π½ΠΎΡ—
ΡƒΠ³ΠΎΠ΄ΠΈ ΠΏΠΎ всій Π‘Ρ…Ρ–Π΄Π½Ρ–ΠΉ Π„Π²Ρ€ΠΎΠΏΡ–, Π±Π΅Ρ€ΡƒΡ‡ΠΈ ΡƒΡ‡Π°ΡΡ‚ΡŒ Ρƒ всСсторонній ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½Ρ–ΠΉ ΠΌΠΈΡ€Π½Ρ–ΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚Ρ–Ρ—. Напад
Ρ€ΠΎΡΡ–ΠΉΡΡŒΠΊΠΈΡ… Π²Ρ–ΠΉΡΡŒΠΊΠΎΠ²ΠΈΡ… Π½Π° ΡƒΠΊΡ€Π°Ρ—Π½ΡΡŒΠΊΠΈΠΉ Π½Π°Ρ€ΠΎΠ΄ Ρ– ΠΉΠΎΠ³ΠΎ уряд ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ Π±ΡƒΡ‚ΠΈ засудТСний Π½Π°ΠΉΡ€Ρ–ΡˆΡƒΡ‡Ρ–ΡˆΠΈΠΌ Ρ‡ΠΈΠ½ΠΎΠΌ Ρ–
ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ Π½Π΅Π³Π°ΠΉΠ½ΠΎ припинитися. Всі ΠΊΡ€Π°Ρ—Π½ΠΈ, які Π²ΠΈΠΊΠΎΡ€ΠΈΡΡ‚ΠΎΠ²ΡƒΡŽΡ‚ΡŒ Π²Ρ–ΠΉΡΡŒΠΊΠΎΠ²Ρ– альянси для постійного
коригування сфСр інтСрСсів Ρ– постійно ТокСя для Ρ‚Π°ΠΊΡ‚ΠΈΡ‡Π½ΠΈΡ… Ρ– стратСгічних ΠΏΠ΅Ρ€Π΅Π²Π°Π³, ΠΏΠΎΠ²ΠΈΠ½Π½Ρ– ΠΏΡ€ΠΈΠΏΠΈΠ½ΠΈΡ‚ΠΈ
свою Π΄Π΅ΡΡ‚Π°Π±Ρ–Π»Ρ–Π·ΡƒΡŽΡ‡Ρƒ ΠΏΡ€Π°ΠΊΡ‚ΠΈΠΊΡƒ. Всі сторони ΠΏΠΎΠ²ΠΈΠ½Π½Ρ– прокинутися: ΠΌΠΈ Π½Π΅ Ρ‚Ρ–Π»ΡŒΠΊΠΈ всі дивлямося Π² ядСрну
ΠΏΡ€Ρ–Ρ€Π²Ρƒ Ρ‡Π΅Ρ€Π΅Π· Ρ‚Ρ€ΠΈΠ²Π°Π»Ρ– Π½Π΅Π²Π΄Π°Π»Ρ– спроби роззброєння – ΠΏΠ»Π°Π½Π΅Ρ‚Π° Ρ‚Π°ΠΊΠΎΠΆ Π·Π½Π°Ρ…ΠΎΠ΄ΠΈΡ‚ΡŒΡΡ Π² Π»Π΅Ρ‰Π°Ρ‚Π°Ρ…
Π½Π΅ΠΊΠΎΠ½Ρ‚Ρ€ΠΎΠ»ΡŒΠΎΠ²Π°Π½ΠΎΡ— ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΎΡ— спіралі, яка ΠΏΡ€Π°ΠΊΡ‚ΠΈΡ‡Π½ΠΎ Π½Π°ΠΏΠ΅Π²Π½ΠΎ Π·Ρ€ΠΎΠ±ΠΈΡ‚ΡŒ Ρ—Ρ— Π½Π΅ΠΏΡ€ΠΈΠ΄Π°Ρ‚Π½ΠΎΡŽ для Тиття Π²
Ρ†ΡŒΠΎΠΌΡƒ ΠΏΠΎΠΊΠΎΠ»Ρ–Π½Π½Ρ–. Eurosolar, Π„Π²Ρ€ΠΎΠΏΠ΅ΠΉΡΡŒΠΊΠ° асоціація Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½ΠΎΡ— Π΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠΈ, Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ” Π΄ΠΎ ΠΏΠΎΠ²Π½ΠΎΠ³ΠΎ Ρ–
швидкого ΠΏΠ΅Ρ€Π΅Ρ…ΠΎΠ΄Ρƒ Π΄ΠΎ Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½ΠΎΡ— Π΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠΈ, Ρ‰ΠΎΠ± покласти ΠΊΡ€Π°ΠΉ залСТності Π„Π²Ρ€ΠΎΠΏΠΈ Ρ‚Π° світу Π²Ρ–Π΄
Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠ³ΠΎ ΠΏΠ°Π»ΠΈΠ²Π°. Π¦Π΅ ΠΏΡ€ΠΈΠ·Π²Π΅Π»ΠΎ Π΄ΠΎ нСскінчСнної Π²Ρ–ΠΉΠ½ΠΈ, Π½Π΅Π²ΠΈΠΌΠΎΠ²Π½ΠΈΡ… ΡΡ‚Ρ€Π°ΠΆΠ΄Π°Π½ΡŒ Ρ– ΡΡ‚Ρ€Π°ΠΆΠ΄Π°Π½ΡŒ протягом
Π±Ρ–Π»ΡŒΡˆ Π½Ρ–ΠΆ століття, Π²Ρ–Π΄Π²ΠΎΠ»Ρ–ΠΊΠ°ΡŽΡ‡ΠΈ Π²Ρ–Π΄ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠΏΡ–Π»ΡŒΠ½ΠΎΠ³ΠΎ Π²ΠΎΡ€ΠΎΠ³Π°, який ΠΏΠΎΠ³Ρ€ΠΎΠΆΡƒΡ” споТивати всС:
прискорСнС нагрівання ΠΊΠ»Ρ–ΠΌΠ°Ρ‚Ρƒ Π½Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠΌΡƒ ΠΏΠ°Π»ΠΈΠ²Ρ–. Π£ΠΊΡ€Π°Ρ—Π½ΡΡŒΠΊΠ° сСкція EUROSOLAR Ρ‚Π° Ρ—Ρ— ΠΌΠ΅Ρ€Π΅ΠΆΡ– Π²ΠΆΠ΅
Π΄Π°Π²Π½ΠΎ Π²ΠΈΡΡ‚ΡƒΠΏΠ°ΡŽΡ‚ΡŒ Π·Π° Π½ΠΎΠ²Ρƒ Π΅ΠΏΠΎΡ…Ρƒ Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½ΠΎΡ— Π΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠΈ Ρƒ Π‘Ρ…Ρ–Π΄Π½Ρ–ΠΉ Π„Π²Ρ€ΠΎΠΏΡ–. Π Π°Π·ΠΎΠΌ Π· усіма Ρ–Π½ΡˆΠΈΠΌΠΈ
нашими сСкціями Ρ‚Π° Ρ‡Π»Π΅Π½Π°ΠΌΠΈ Π½Π° Ρ”Π²Ρ€ΠΎΠΏΠ΅ΠΉΡΡŒΠΊΠΎΠΌΡƒ ΠΊΠΎΠ½Ρ‚ΠΈΠ½Π΅Π½Ρ‚Ρ–, Π²Ρ–Π΄ Росії Π΄ΠΎ НідСрландів, Π° Ρ‚Π°ΠΊΠΎΠΆ Π²Ρ–Π΄
Π’ΡƒΡ€Π΅Ρ‡Ρ‡ΠΈΠ½ΠΈ Π΄ΠΎ Π”Π°Π½Ρ–Ρ—, EUROSOLAR ΠΏΡ€ΠΎΠΏΠΎΠ½ΡƒΡ” Ρ†ΡŽ ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½Ρƒ ΠΌΠΈΡ€Π½Ρƒ ΠΏΠ»Π°Ρ‚Ρ„ΠΎΡ€ΠΌΡƒ. ΠŸΡ€ΠΎΡ„. ΠŸΡ–Ρ‚Π΅Ρ€ Π”Ρ€ΠΎΡƒΠ΄ΠΆ, ΠŸΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚
EUROSOLAR: β€ž Настав час для ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΎΡ— ΠΌΠΈΡ€Π½ΠΎΡ— Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚Ρ–Ρ—, Ρ‰ΠΎΠ± протистояти ΡΠΏΡ–Π»ΡŒΠ½ΠΎΠΌΡƒ Π²ΠΎΡ€ΠΎΠ³Ρƒ
ΠΊΠΎΠΆΠ½ΠΎΠ³ΠΎ: ΠΏΠ΅Ρ€Π΅Π΄ΠΎΠ²Ρ–ΠΉ дСстабілізації Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠ³ΠΎ ΠΊΠ»Ρ–ΠΌΠ°Ρ‚Ρƒ. Π¦Π΅ ΠΎΠ΄Π½Π° Π· дСсяти Π΄Ρ–ΠΉ, прСдставлСних EUROSOLAR
як основний порядок Π΄Π΅Π½Π½ΠΈΠΉ нашого часу. β€œ Π— ΠΌΠΎΠΌΠ΅Π½Ρ‚Ρƒ свого заснування Π² 1988 Ρ€ΠΎΡ†Ρ– EUROSOLAR ΠΏΡ€Π°Ρ†ΡŽΠ²Π°Π²
Π½Π°Π΄ припинСнням Π²Ρ–ΠΉΠ½ΠΈ Π½Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠΌΡƒ ΠΏΠ°Π»ΠΈΠ²Ρ– ΡˆΠ»ΡΡ…ΠΎΠΌ Π²Π΅Π»ΠΈΠΊΠΎΠ³ΠΎ ΠΏΠ΅Ρ€Π΅Ρ…ΠΎΠ΄Ρƒ Π½Π° 100% Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½Ρƒ Π΅Π½Π΅Ρ€Π³Ρ–ΡŽ. Π—Π°
словами Π“Π΅Ρ€ΠΌΠ°Π½Π° Π¨ΠΈΡ€Π° ( 1944-2010), засновника EUROSOLAR: Β« Π’Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½Ρ– Π΄ΠΆΠ΅Ρ€Π΅Π»Π° Π΅Π½Π΅Ρ€Π³Ρ–Ρ— ΡΡ‚Π²ΠΎΡ€ΡŽΡŽΡ‚ΡŒ
ΠΌΠΈΡ€ Β». Π•ΠΏΠΎΡ…Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎ-ядСрних Π·Π°Π³Ρ€ΠΎΠ· ΠΏΠΎΠ²ΠΈΠ½Π½Π° закінчитися, ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ початися Π΅ΠΊΠ·ΠΈΡΡ‚Π΅Π½Ρ†Ρ–Π°Π»ΡŒΠ½ΠΈΠΉ фокус:
www.earthdecade.org. EUROSOLAR Ρ‚Π°ΠΊΠΎΠΆ Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ” Π΄ΠΎ Π·ΠΌΡ–Π½ΠΈ мислСння Π² Π±Ρ–ΠΊ ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΎΡ— ΠΌΠΈΡ€Π½ΠΎΡ—
Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚Ρ–Ρ—, яка Π²ΠΈΠ·Π½Π°Ρ” Ρ– Π±ΠΎΡ€Π΅Ρ‚ΡŒΡΡ Π· Π²ΠΈΠΊΠΎΠΏΠ½ΠΈΠΌΠΈ залСТностями як Π½Π°ΠΉΠ±Ρ–Π»ΡŒΡˆΠΈΠΉ ΡΠΏΡ–Π»ΡŒΠ½ΠΈΠΉ Π²ΠΎΡ€ΠΎΠ³ Π»ΡŽΠ΄ΡΡ‚Π²Π°.
https:
//www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom
cy/ ΠžΡ‚ΠΊΡ€Ρ‹Ρ‚ΠΎΠ΅ письмо ΠΏΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚Π°ΠΌ ΠŸΡƒΡ‚ΠΈΠ½Ρƒ, Π‘Π°ΠΉΠ΄Π΅Π½Ρƒ, ЗСлСнскому ΠΈ Π›ΡƒΠΊΠ°ΡˆΠ΅Π½ΠΊΠΎ: EUROSOLAR, ЕвропСйская
ассоциация возобновляСмой энСргСтики, ΠΏΡ€ΠΈΠ·Ρ‹Π²Π°Π΅Ρ‚ ΠΊ Π½Π΅ΠΌΠ΅Π΄Π»Π΅Π½Π½ΠΎΠΌΡƒ ΠΏΡ€Π΅ΠΊΡ€Π°Ρ‰Π΅Π½ΠΈΡŽ климатичСского огня ΠΈ
Π·Π°ΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΡŽ постоянного климатичСского ΠΌΠΈΡ€Π½ΠΎΠ³ΠΎ соглашСния ΠΏΠΎ всСй Восточной Π•Π²Ρ€ΠΎΠΏΠ΅ – ΠΈ, Ρ‚Π°ΠΊΠΈΠΌ
ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ, ΠΊ Π½Π°Ρ‡Π°Π»Ρƒ многостороннСй климатичСской ΠΌΠΈΡ€Π½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚ΠΈΠΈ. НападСниС российских Π²ΠΎΠ΅Π½Π½Ρ‹Ρ… Π½Π°
украинский Π½Π°Ρ€ΠΎΠ΄ ΠΈ Π΅Π³ΠΎ ΠΏΡ€Π°Π²ΠΈΡ‚Π΅Π»ΡŒΡΡ‚Π²ΠΎ Π΄ΠΎΠ»ΠΆΠ½ΠΎ Π±Ρ‹Ρ‚ΡŒ осуТдСно самым Ρ€Π΅ΡˆΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹ΠΌ ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ ΠΈ Π½Π΅ΠΌΠ΅Π΄Π»Π΅Π½Π½ΠΎ
остановлСно. ВсС страны, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‚ Π²ΠΎΠ΅Π½Π½Ρ‹Π΅ ΡΠΎΡŽΠ·Ρ‹ для постоянной ΠΊΠΎΡ€Ρ€Π΅ΠΊΡ‚ΠΈΡ€ΠΎΠ²ΠΊΠΈ своих сфСр
интСрСсов ΠΈ постоянной Π±ΠΎΡ€ΡŒΠ±Ρ‹ Π·Π° тактичСскоС ΠΈ стратСгичСскоС прСимущСство, Π΄ΠΎΠ»ΠΆΠ½Ρ‹ ΠΏΡ€Π΅ΠΊΡ€Π°Ρ‚ΠΈΡ‚ΡŒ свою
Π΄Π΅ΡΡ‚Π°Π±ΠΈΠ»ΠΈΠ·ΠΈΡ€ΡƒΡŽΡ‰ΡƒΡŽ ΠΏΡ€Π°ΠΊΡ‚ΠΈΠΊΡƒ. ВсС Π²ΠΎΠ²Π»Π΅Ρ‡Π΅Π½Π½Ρ‹Π΅ стороны Π΄ΠΎΠ»ΠΆΠ½Ρ‹ ΠΏΡ€ΠΎΡΠ½ΡƒΡ‚ΡŒΡΡ: Мало Ρ‚ΠΎΠ³ΠΎ, Ρ‡Ρ‚ΠΎ ΠΌΡ‹ всС
смотрим Π² ΡΠ΄Π΅Ρ€Π½ΡƒΡŽ Π±Π΅Π·Π΄Π½Ρƒ ΠΈΠ·-Π·Π° Π΄Π»ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹Ρ… Π½Π΅ΡƒΠ΄Π°Ρ‡Π½Ρ‹Ρ… ΠΏΠΎΠΏΡ‹Ρ‚ΠΎΠΊ разоруТСния – ΠΏΠ»Π°Π½Π΅Ρ‚Π° Ρ‚Π°ΠΊΠΆΠ΅ находится Π²
Π½Π΅ΠΊΠΎΠ½Ρ‚Ρ€ΠΎΠ»ΠΈΡ€ΡƒΠ΅ΠΌΠΎΠΉ климатичСской спирали, которая ΠΏΠΎΡ‡Ρ‚ΠΈ навСрняка сдСлаСт Π΅Π΅ Π½Π΅ΠΏΡ€ΠΈΠ³ΠΎΠ΄Π½ΠΎΠΉ для ΠΆΠΈΠ·Π½ΠΈ
ΡƒΠΆΠ΅ Π² этом ΠΏΠΎΠΊΠΎΠ»Π΅Π½ΠΈΠΈ. EUROSOLAR, ЕвропСйская ассоциация возобновляСмых источников энСргии,
ΠΏΡ€ΠΈΠ·Ρ‹Π²Π°Π΅Ρ‚ ΠΊ ΠΏΠΎΠ»Π½ΠΎΠΌΡƒ ΠΈ быстрому ΠΏΠ΅Ρ€Π΅Ρ…ΠΎΠ΄Ρƒ Π½Π° возобновляСмыС источники энСргии, Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΏΠΎΠ»ΠΎΠΆΠΈΡ‚ΡŒ ΠΊΠΎΠ½Π΅Ρ†
зависимости Π•Π²Ρ€ΠΎΠΏΡ‹ ΠΈ всСго ΠΌΠΈΡ€Π° ΠΎΡ‚ ископаСмого Ρ‚ΠΎΠΏΠ»ΠΈΠ²Π°. Она ΠΏΡ€ΠΈΠ²Π΅Π»Π° ΠΊ бСсконСчным Π²ΠΎΠΉΠ½Π°ΠΌ,
Π½Π΅Π²Ρ‹Ρ€Π°Π·ΠΈΠΌΡ‹ΠΌ страданиям ΠΈ Π½Π΅ΡΡ‡Π°ΡΡ‚ΡŒΡΠΌ Π½Π° протяТСнии Π±ΠΎΠ»Π΅Π΅ Π²Π΅ΠΊΠ°, отвлСкая нас ΠΎΡ‚ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΎΠ±Ρ‰Π΅Π³ΠΎ Π²Ρ€Π°Π³Π°,
ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ ΡƒΠ³Ρ€ΠΎΠΆΠ°Π΅Ρ‚ ΠΏΠΎΠ³Π»ΠΎΡ‚ΠΈΡ‚ΡŒ всСх нас: ускорСнного глобального потСплСния, Π²Ρ‹Π·Π²Π°Π½Π½ΠΎΠ³ΠΎ ископаСмым
Ρ‚ΠΎΠΏΠ»ΠΈΠ²ΠΎΠΌ. Украинская сСкция EUROSOLAR ΠΈ Π΅Π΅ сСти Π΄Π°Π²Π½ΠΎ Π²Ρ‹ΡΡ‚ΡƒΠΏΠ°ΡŽΡ‚ Π·Π° Π½ΠΎΠ²ΡƒΡŽ эру с возобновляСмыми
источниками энСргии Π² Восточной Π•Π²Ρ€ΠΎΠΏΠ΅. ВмСстС со всСми Π΄Ρ€ΡƒΠ³ΠΈΠΌΠΈ нашими сСкциями ΠΈ Ρ‡Π»Π΅Π½Π°ΠΌΠΈ ΠΏΠΎ всСму
СвропСйскому ΠΊΠΎΠ½Ρ‚ΠΈΠ½Π΅Π½Ρ‚Ρƒ, ΠΎΡ‚ России Π΄ΠΎ НидСрландов ΠΈ ΠΎΡ‚ Π’ΡƒΡ€Ρ†ΠΈΠΈ Π΄ΠΎ Π”Π°Π½ΠΈΠΈ, EUROSOLAR ΠΏΡ€Π΅Π΄Π»Π°Π³Π°Π΅Ρ‚ эту
ΠΏΠ»Π°Ρ‚Ρ„ΠΎΡ€ΠΌΡƒ ΠΌΠΈΡ€Π° ΠΊΠ»ΠΈΠΌΠ°Ρ‚Ρƒ. ΠŸΡ€ΠΎΡ„Π΅ΡΡΠΎΡ€ ΠŸΠ΅Ρ‚Π΅Ρ€ Π”Ρ€ΠΎΠ³Π΅, ΠΏΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚ EUROSOLAR: β€ž Настало врСмя для
климатичСской ΠΌΠΈΡ€Π½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚ΠΈΠΈ, Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΏΡ€ΠΎΡ‚ΠΈΠ²ΠΎΡΡ‚ΠΎΡΡ‚ΡŒ ΠΎΠ±Ρ‰Π΅ΠΌΡƒ для всСх Π²Ρ€Π°Π³Ρƒ: дСстабилизации ΠΊΠ»ΠΈΠΌΠ°Ρ‚Π°
Π·Π° счСт ΠΏΠ΅Ρ€Π΅Π΄ΠΎΠ²ΠΎΠ³ΠΎ ископаСмого Ρ‚ΠΎΠΏΠ»ΠΈΠ²Π°. Π­Ρ‚ΠΎ ΠΎΠ΄Π½ΠΎ ΠΈΠ· дСсяти дСйствий, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ EUROSOLAR прСдставляСт
ΠΊΠ°ΠΊ ΡΠ°ΠΌΡƒΡŽ Π²Π°ΠΆΠ½ΡƒΡŽ повСстку дня нашСго Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ β€œ. Π”ΠΎΠΊΡ‚ΠΎΡ€ Π‘Ρ€ΠΈΠ³ΠΈΡ‚Ρ‚Π΅ Π¨ΠΌΠΈΠ΄Ρ‚, Π²ΠΈΡ†Π΅-ΠΏΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚ ΠΈ Ρ‡Π»Π΅Π½
правлСния EUROSOLAR ГСрмания: β€ž Наступило врСмя возобновляСмого ΠΌΠΈΡ€Π°, Ρ‡Π°ΡΡ‚ΡŒ нашСй ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΡ‹ β€ž
ВозобновляСмоС дСсятилСтиС β€œ. Он выступаСт Π·Π° пСрСосмыслСниС ΠΈ ΠΌΠΈΡ€Π½Ρ‹Π΅ дСйствия Π²ΠΎ имя нашСго ΠΎΠ±Ρ‰Π΅Π³ΠΎ
Π±ΡƒΠ΄ΡƒΡ‰Π΅Π³ΠΎ. Π‘ ΠΌΠΎΠΌΠ΅Π½Ρ‚Π° своСго основания Π² 1988 Π³ΠΎΠ΄Ρƒ компания EUROSOLAR Ρ€Π°Π±ΠΎΡ‚Π°Π΅Ρ‚ Π½Π°Π΄ Ρ‚Π΅ΠΌ, Ρ‡Ρ‚ΠΎΠ±Ρ‹
ΠΏΠΎΠ»ΠΎΠΆΠΈΡ‚ΡŒ ΠΊΠΎΠ½Π΅Ρ† Π²ΠΎΠΉΠ½Π°ΠΌ Π·Π° ископаСмоС Ρ‚ΠΎΠΏΠ»ΠΈΠ²ΠΎ ΠΏΡƒΡ‚Π΅ΠΌ ΠΌΠ°ΡΡˆΡ‚Π°Π±Π½ΠΎΠ³ΠΎ ΠΏΠ΅Ρ€Π΅Ρ…ΠΎΠ΄Π° Π½Π° 100% возобновляСмыС
источники энСргии. По словам Π“Π΅Ρ€ΠΌΠ°Π½Π° Π¨Π΅Π΅Ρ€Π° ( 1944-2010), основатСля EUROSOLAR: β€ž ВозобновляСмыС
источники энСргии ΡΠΎΠ·Π΄Π°ΡŽΡ‚ ΠΌΠΈΡ€ β€œ. Π’Π΅ΠΊ ископаСмо-ядСрных ΡƒΠ³Ρ€ΠΎΠ· Π΄ΠΎΠ»ΠΆΠ΅Π½ Π·Π°ΠΊΠΎΠ½Ρ‡ΠΈΡ‚ΡŒΡΡ, Π΄ΠΎΠ»ΠΆΠ½Π° Π½Π°Ρ‡Π°Ρ‚ΡŒΡΡ
ΡΠΊΠ·ΠΈΡΡ‚Π΅Π½Ρ†ΠΈΠ°Π»ΡŒΠ½Π°Ρ ориСнтация: https: //www.earthdecade.org. EUROSOLAR ΠΏΡ€ΠΈΠ·Ρ‹Π²Π°Π΅Ρ‚ ΠΊ ΠΏΠ΅Ρ€Π΅ΠΎΡΠΌΡ‹ΡΠ»Π΅Π½ΠΈΡŽ Π²
сторону климатичСской ΠΌΠΈΡ€Π½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚ΠΈΠΈ, которая ΠΏΡ€ΠΈΠ·Π½Π°Π΅Ρ‚ ΠΈ борСтся с ископаСмой Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡ‚ΡŒΡŽ ΠΊΠ°ΠΊ
Π²Π΅Π»ΠΈΡ‡Π°ΠΉΡˆΠΈΠΌ ΠΎΠ±Ρ‰ΠΈΠΌ Π²Ρ€Π°Π³ΠΎΠΌ чСловСчСства.https:
//www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom
cy/ Independent of political parties, institutions, companies and interest groups, EUROSOLAR has
been developing and stimulating political and economic action drafts and concepts for the
introduction of renewable energies since 1988. This ranges from market introduction strategies to
proposals for further research and development policy, from tax policy subsidies to arms conversion
with solar energy, from the contribution of solar energy for the Global South to agricultural,
transport and construction policy. EuropΓ€ische Vereinigung fΓΌr Erneuerbare Energien e. V.
articles_df = articles_df[articles_df["lang"] == "en"]

Our exploration revealed a small number of articles containing non-English content (some in German and 1 with a Russian section). Since most LLMs and embedding models are primarily trained on English text, removing these articles ensures compatibility with our chosen models for this notebook. For simplicity, we'll only focus on supporting English queries and responses within this RAG pipeline.

Challenges of Multilingual RAG PipelinesΒΆ

Introducing multilingual capabilities into a RAG pipeline presents an additional layer of complexity. Here's a breakdown of some key challenges:

Characters, Tokens and WordsΒΆ

Let us further analyze the contents of the articles. However, before we do so let us define the meaning of characters, tokens and words:

sns.histplot(articles_df["article"].map(len), kde=True)

plt.title("Amount of characters in articles")
plt.xlabel("Amount of characters")
plt.ylabel("Number of articles")
median_char_len = articles_df["article"].map(len).median()
mean_char_len = articles_df["article"].map(len).mean()
plt.axvline(median_char_len, color='r', linestyle='--', label=f"Median character amount: {median_char_len:.2f}")
plt.axvline(mean_char_len, color='g', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
sns.histplot(articles_df["article"].map(lambda x: len(x.split())), kde=True)

plt.title("Amount of words in articles")
plt.xlabel("Amount of words")
plt.ylabel("Number of articles")
median_word_len = articles_df["article"].map(lambda x: len(x.split())).median()
mean_word_len = articles_df["article"].map(lambda x: len(x.split())).mean()
plt.axvline(median_word_len, color='r', linestyle='--', label=f"Median word amount: {median_word_len:.2f}")
plt.axvline(mean_word_len, color='g', linestyle='--', label=f"Mean word amount: {mean_word_len:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
nlp = English()
tokenizer = nlp.tokenizer

sns.histplot(articles_df["article"].map(lambda x: len(tokenizer(x))), kde=True)

plt.title("Amount of tokens in articles")
plt.xlabel("Amount of tokens")
plt.ylabel("Number of articles")
median_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).median()
mean_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).mean()
plt.axvline(median_token_len, color='r', linestyle='--', label=f"Median token amount: {median_token_len:.2f}")
plt.axvline(mean_token_len, color='g', linestyle='--', label=f"Mean token amount: {mean_token_len:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
all_tokens = [token.text for article in articles_df["article"] for token in tokenizer(article)]
# remove non-alphabetic tokens such as punctuation
alpha_tokens = [token for token in all_tokens if token.isalpha()]
alpha_tokens = [token.lower() for token in alpha_tokens]
alpha_token_counts = Counter(alpha_tokens)

sns.barplot(
    x=[count for token, count in alpha_token_counts.most_common(20)],
    y=[token for token, count in alpha_token_counts.most_common(20)],
    hue=[token for token, count in alpha_token_counts.most_common(20)]
)

plt.title("Most common alphabetic tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
No description has been provided for this image

The initial approach returns common words which do not reflect the subject-specific nature of our document collection. We will remove them to understand the content of the texts better.

# remove stopwords such as 'the', 'a', 'and'
non_stop_tokens = [token for token in alpha_tokens if not nlp.vocab[token].is_stop]
non_stop_token_counts = Counter(non_stop_tokens)

sns.barplot(
    x=[count for token, count in non_stop_token_counts.most_common(20)],
    y=[token for token, count in non_stop_token_counts.most_common(20)],
    hue=[token for token, count in non_stop_token_counts.most_common(20)]
)

plt.title("Most common non-stopword tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
No description has been provided for this image

As one would expect in a dataset of cleantech news articles most of the tokens that are not punctation or stopwords revolve around the subjects of energy, climate, and technology. This is a good sign that the dataset is relevant to the topic at hand. The "s" token comes up frequently, which is likely due to the possessive form of words. With an average of around 700 words per article, we can expect a good amount of information to be present in each article and an average reading time of around 3-4 minutes.

Flesch Reading Ease ScoreΒΆ

The Flesch Reading Ease Score (FRES, a.k.a Flesch-Kincaid Reading Ease Score) is a heuristic used to evaluate how easy it is to understand a text based on the length of sentences and the number of syllables per word. Scores can range from -100 (very difficult to read) to 100 (very easy to read). Scores below 50 are indicative of difficult texts for College level. This metric can be useful for assessing the readability of our articles and ensuring they are accessible to a broad audience.

articles_df["readability"] = articles_df["article"].apply(flesch_reading_ease)

sns.histplot(articles_df["readability"], kde=True)

plt.title("Flesch Reading Ease of articles")
plt.xlabel("Flesch Reading Ease")
plt.ylabel("Number of articles")
mean_readability = articles_df["readability"].mean()
plt.axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
plt.legend()
plt.show()
No description has been provided for this image

We analyze now the diversity of language complexity used by different publishing domains.

domains = articles_df["domain"].unique()

# Setup the subplots based on the number of domains
plots_per_row = 3
num_rows = (len(domains) + 2) // plots_per_row 
plot_height = 6 
fig, axes = plt.subplots(num_rows, plots_per_row, figsize=(plot_height * plots_per_row, plot_height * num_rows))
axes = axes.flatten()  # Flatten the axes array for easier iteration

# Plot for each domain
for i, domain in enumerate(domains):
    domain_articles = articles_df[articles_df["domain"] == domain]
    sns.histplot(domain_articles["readability"], kde=True, ax=axes[i], bins=30)
    axes[i].set_title(f'Readability of {domain}')
    axes[i].set_xlabel('Flesch Reading Ease Score')
    axes[i].set_ylabel("Number of articles")
    mean_readability = domain_articles["readability"].mean()
    axes[i].axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")

# remove the empty plots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()
No description has been provided for this image

To gauge the readability of our articles, we calculated the Flesch Reading Ease Score. The average score of around 45 indicates a "fairly easy" reading level, which is positive news. This suggests the content is likely accessible to a broad audience and, consequently, understandable by our RAG pipeline as well.

Our analysis revealed a consistent average Flesch Reading Ease Score across most of the identified domains, with minor variations. This indicates a relatively consistent level of readability across different publishers within the dataset.

Finally we will save the cleaned dataset to a new file in the data/silver folder.

silver_folder = data_folder / "silver"
if not silver_folder.exists():
    silver_folder.mkdir()

articles_df.to_csv(silver_folder / "articles.csv", index=False)

Evaluation DataΒΆ

Next we will analyze the provided evaluation data and ensure that they match the content of the articles.

human_eval_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, 1 to 23
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   question_id     23 non-null     int64 
 1   question        23 non-null     object
 2   relevant_chunk  23 non-null     object
 3   article_url     23 non-null     object
dtypes: int64(1), object(3)
memory usage: 920.0+ bytes
human_eval_df.rename(columns={"relevant_chunk":"relevant_section","article_url": "url"}, inplace=True)
human_eval_df.drop(columns=["question_id"], inplace=True)
human_eval_df.head()
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... https://www.sgvoice.net/strategy/technology/23...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... https://www.sgvoice.net/policy/25396/eu-seeks-...
3 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... https://www.pv-magazine.com/2023/02/02/europea...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... https://www.sgvoice.net/policy/25396/eu-seeks-...
5 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... https://cleantechnica.com/2023/05/08/general-m...
sns.histplot(human_eval_df["question"].map(len), kde=True)
plt.title("Question Character Length Distribution")
plt.xlabel("Character Length")
plt.ylabel("Count")
mean_char_len = human_eval_df["question"].map(len).mean()
plt.axvline(mean_char_len, color='r', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... https://www.sgvoice.net/strategy/technology/23...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... https://www.sgvoice.net/policy/25396/eu-seeks-...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... https://www.sgvoice.net/policy/25396/eu-seeks-...

Our exploration has identified instances where articles linked to specific questions appear to be missing from the dataset. To determine the root cause, let's investigate whether these articles are genuinely absent or if inconsistencies in URL formatting are creating the illusion of missing data. Normalizing the URLs across the dataset will help us differentiate between these two scenarios.

def normalize_url(url: str) -> str:
    url = url.replace("https://", "")
    url = url.replace("http://", "")
    url = url.replace("www.", "")
    url = url.rstrip("/")
    return url

articles_df["url"] = articles_df["url"].map(normalize_url)
human_eval_df["url"] = human_eval_df["url"].map(normalize_url)

missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... sgvoice.net/strategy/technology/23971/leclanch...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... sgvoice.net/policy/25396/eu-seeks-competitive-...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... sgvoice.net/policy/25396/eu-seeks-competitive-...

We also know from previous analysis that some duplicate articles from the "energyvoice" domain so we will also normalize these URLs.

missing_articles["url"] = missing_articles["url"].map(lambda x: x.replace("sgvoice.net", "sgvoice.energyvoice.com"))
missing_articles[~missing_articles["url"].isin(articles_df["url"])]
question relevant_section url
example_id
human_eval_df.loc[missing_articles.index, "url"] = missing_articles["url"]
human_eval_df[human_eval_df["url"].isin(articles_df["url"])]
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... sgvoice.energyvoice.com/strategy/technology/23...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... sgvoice.energyvoice.com/policy/25396/eu-seeks-...
3 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... pv-magazine.com/2023/02/02/european-commission...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... sgvoice.energyvoice.com/policy/25396/eu-seeks-...
5 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... cleantechnica.com/2023/05/08/general-motors-se...
6 Did Colgate-Palmolive enter into PPA agreement... Scout Clean Energy, a Colorado-based renewable... solarindustrymag.com/scout-and-colgate-palmoli...
7 What is the status of ZeroAvia's hydrogen fuel... In December, the US startup ZeroAvia announced... cleantechnica.com/2023/01/02/the-wait-for-hydr...
8 What is the "Danger Season"? As spring turns to summer and the days warm up... cleantechnica.com/2023/05/15/what-does-a-norma...
9 Is Mississipi an anti-ESG state? Mississippi is among two dozen or so states in... cleantechnica.com/2023/05/15/mississippi-takes...
10 Can you hang solar panels on garden fences? Scaling down from the farm to the garden level... cleantechnica.com/2023/05/18/solar-panels-for-...
11 Who develops quality control systems for ocean... Scientists from the Chinese Academy of Science... azocleantech.com/news.aspx?newsID=32873
12 Why are milder winters detrimental for grapes ... Since grapes and apples are perennial species,... azocleantech.com/news.aspx?newsID=33040
13 What are the basic recycling steps for solar p... There are some simple recycling steps that can... azocleantech.com/news.aspx?newsID=33143
14 Why does melting ice contribute to global warm... Whereas white ice reflects the sun's rays, a d... azocleantech.com/news.aspx?newsID=33149
15 Does the Swedish government plan bans on new p... The Swedish government has proposed a ban on n... azocleantech.com/news.aspx?newsID=33174
16 Where do the turbines used in Icelandic geothe... Minister Nishimura mentioned that most geother... thinkgeoenergy.com/japan-and-iceland-agree-on-...
17 Who is the target user for Leapfrog Energy? O’Brien added, β€œSubsurface specialists need fl... thinkgeoenergy.com/seequent-expands-subsurface...
18 What is Agrivoltaics? Agrivoltaics, the integration of food producti... pv-magazine.com/2023/03/31/new-software-modeli...
19 What is Agrivoltaics? Agrivoltaics refers to the conduct of agricult... cleantechnica.com/2022/12/18/agrivoltaics-goes...
20 Why is cannabis cultivation moving indoors? Cannabis cultivation can take place outdoors, ... pv-magazine.com/2023/04/08/high-time-for-solar...
21 What are the obstacles for cannabis producers ... β€œThere are a lot of prevailing headwinds for c... pv-magazine.com/2023/04/08/high-time-for-solar...
22 In 2021, what were the top 3 states in the US ... In 2021, Florida surpassed North Carolina to b... cleantechnica.com/2023/04/10/solar-power-in-fl...
23 Which has the higher absorption coefficient fo... We chose amorphous germanium instead of amorph... pv-magazine.com/2021/01/15/germanium-based-sol...

In the end we are able to find all the articles that are linked to the evaluation data and have therefore successfully completed our exploratory data analysis and preprocessing.

SubsamplingΒΆ

For faster processing and to reduce the cost of running the notebook we will subsample the dataset to 1000 articles. This will allow us to run the notebook in a reasonable amount of time and still provide meaningful results. Because the distribution of articles across publishers is skewed we will use stratified sampling to ensure that we have a representative sample. We also need to keep in mind that the evaluation data are linked to specific articles so we need to make sure that these are included in the subsample.

eval_articles_df = articles_df[articles_df["url"].isin(human_eval_df["url"])]
eval_articles_df.head()
title content domain url article lang readability
6780 Leclanché’ s new disruptive battery boosts ene... ['Energy storage company LeclanchΓ© ( SW.LECN) ... energyvoice sgvoice.energyvoice.com/strategy/technology/23... Energy storage company LeclanchΓ© ( SW.LECN) ha... en 43.22
6805 EU seeks competitive boost with Green Deal Ind... ['The EU has presented its β€˜ Green Deal Indust... energyvoice sgvoice.energyvoice.com/policy/25396/eu-seeks-... The EU has presented its β€˜ Green Deal Industri... en 34.70
16367 Agrivoltaics Goes Nuclear On California Prairie ['A decommissioned nuclear power plant from th... cleantechnica cleantechnica.com/2022/12/18/agrivoltaics-goes... A decommissioned nuclear power plant from the ... en 42.00
16402 The Wait For Hydrogen Fuel Cell Electric Aircr... ['The US firm ZeroAvia is one step closer to b... cleantechnica cleantechnica.com/2023/01/02/the-wait-for-hydr... The US firm ZeroAvia is one step closer to bri... en 50.46
16725 Solar Power In Florida ['Many renewable energy endeavors in Florida a... cleantechnica cleantechnica.com/2023/04/10/solar-power-in-fl... Many renewable energy endeavors in Florida are... en 44.75
print(eval_articles_df["url"].unique().shape)
print(human_eval_df["url"].unique().shape)
(21,)
(21,)
def do_stratification(
        df: pd.DataFrame,
        column: str,
        sample_size: int,
        seed: int = 42
) -> pd.DataFrame:
    res_df = df.copy()
    indx = df.groupby(column, group_keys=False)[column].apply(lambda x: x.sample(n=int(sample_size/len(df) * len(x)), random_state=seed)).index.to_list()
    return res_df.loc[indx]
sample_df = do_stratification(articles_df, "domain", 1000, 69)
# if the articles are already in the subsample from the evaluation set, then we remove them, so we just want unique urls
sample_df = sample_df[~sample_df["url"].isin(eval_articles_df["url"])]
sample_df = pd.concat([sample_df, eval_articles_df])
sample_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1011 entries, 38325 to 81779
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        1011 non-null   object 
 1   content      1011 non-null   object 
 2   domain       1011 non-null   object 
 3   url          1011 non-null   object 
 4   article      1011 non-null   object 
 5   lang         1011 non-null   object 
 6   readability  1011 non-null   float64
dtypes: float64(1), object(6)
memory usage: 63.2+ KB

To make sure that the distributional characteristics has not been changed by subsampling we visualize and compare both data sets in relative terms.

original_domain_counts = articles_df["domain"].value_counts().to_frame()
original_domain_counts = original_domain_counts / original_domain_counts.sum() * 100
domain_counts_df = original_domain_counts.copy()
domain_counts_df["type"] = "Original"


sample_domain_counts = sample_df["domain"].value_counts().to_frame()
sample_domain_counts = sample_domain_counts / sample_domain_counts.sum() * 100
sample_domain_counts["type"] = "Sample"

domain_counts_df = pd.concat([domain_counts_df, sample_domain_counts])
sns.barplot(
    x=domain_counts_df.index,
    y=domain_counts_df["count"],
    hue=domain_counts_df["type"]
)
plt.title("Domain Distribution")
plt.xlabel("Domain")
plt.ylabel("Percentage")
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

Now all is prepared to start developing our RAG!

ChunkingΒΆ

Chunking is a crucial step in the RAG pipeline. It involves breaking down the articles into smaller, more manageable pieces.

chunking

There are mainly two reasons for this:

Let's start by getting a better feeling for the most common size of chunks based on the number of characters

def get_lorem_text(num_chars: int) -> str:
    expected_avg_word_len = 3 # on the lower side to be safe
    text = lorem.words(num_chars // expected_avg_word_len)
    return text[:num_chars]
print(wrap_text(get_lorem_text(256)))
repellendus nobis veritatis voluptatem fugit vero odit tenetur ipsam culpa ab officia quas rerum
nihil nemo veniam iure eveniet nesciunt quidem error impedit officiis neque enim consequatur fugiat
illum fuga voluptatibus magni dolor tempore maxime nostrum 
print(wrap_text(get_lorem_text(512)))
debitis tenetur ipsa impedit quod facilis ipsam deserunt quia iste eum quasi alias provident
ducimus numquam aliquid maxime similique veritatis iure tempora doloribus facere inventore fuga
quos omnis necessitatibus soluta expedita maiores dolores incidunt nihil rem laboriosam sunt vel
totam itaque voluptates exercitationem sequi dolorum molestiae sapiente architecto ad ullam commodi
iusto corporis eligendi velit perferendis laborum dicta odit dolor cumque accusamus ea distinctio
nisi consectetur et quidem p
print(wrap_text(get_lorem_text(1024)))
earum maiores quibusdam nam reprehenderit eum voluptatibus mollitia nisi magni quas autem optio
molestias natus expedita totam eius quia atque quod sit ad iste qui ullam corrupti in ipsum
accusantium hic eos illo rerum voluptatem fugiat iure assumenda distinctio nobis consequuntur
itaque ea possimus molestiae amet fuga animi dolores temporibus dolore tempore explicabo corporis
nesciunt consectetur sequi quisquam illum minima odit omnis reiciendis repellat repudiandae
blanditiis minus non necessitatibus sint obcaecati aliquam ex perspiciatis voluptate culpa unde
provident doloribus vel sed suscipit repellendus officiis quaerat libero laborum et quae architecto
ut exercitationem soluta vero aut enim laudantium voluptatum accusamus nulla praesentium deserunt
id asperiores ipsam similique facere aliquid tempora eligendi ratione sapiente neque cumque dolorem
rem delectus dolorum impedit incidunt adipisci esse eveniet ipsa modi perferendis commodi dolor
officia magnam doloremque pariatur velit facilis inventore nos
print(wrap_text(get_lorem_text(2048)))
repellat laborum voluptates sint facilis eaque fuga corporis unde labore quia illo id rem at maxime
iste quae quos aliquid provident atque consectetur doloremque eligendi non dolore quod pariatur ab
rerum quas molestias corrupti sequi blanditiis deserunt qui mollitia temporibus modi sunt harum
consequatur asperiores necessitatibus reprehenderit perspiciatis dicta eveniet ad voluptatum totam
nesciunt amet nihil voluptate alias facere ut ducimus excepturi aperiam nobis beatae aliquam omnis
laudantium cupiditate soluta cum quisquam iusto accusantium exercitationem autem illum neque optio
nisi sit fugiat iure recusandae minima earum natus enim aut debitis odit doloribus voluptas magni
tempore veritatis voluptatibus commodi veniam molestiae libero et magnam vero eos esse nam fugit
voluptatem ipsa porro officia inventore quidem dolores tenetur dolor architecto quo dolorem placeat
ipsam minus sapiente ratione ipsum dolorum quam ex quasi ullam nostrum hic delectus in consequuntur
numquam laboriosam reiciendis culpa est explicabo ea possimus a saepe nulla tempora maiores
dignissimos obcaecati perferendis eius incidunt quibusdam repudiandae suscipit nemo impedit
adipisci similique distinctio sed animi officiis velit quaerat odio accusamus cumque assumenda
vitae deleniti expedita praesentium vel aspernatur eum error itaque quis repellendus culpa earum
eveniet libero cupiditate ea dolorem officia mollitia vitae consequatur veniam repellat delectus
illum sapiente ut ex eaque neque inventore consectetur natus officiis quibusdam modi fuga sunt id
dolore animi similique reprehenderit nulla magni vel iusto odit dolor architecto nostrum tempore
sit perferendis laboriosam corrupti tempora alias dolores iste dolorum cumque enim facere qui
tenetur quasi quia autem iure minus obcaecati distinctio soluta et assumenda nam provident possimus
blanditiis saepe rerum adipisci debitis accusantium minima nesciunt quam deserunt eius magnam omnis
error doloremque quos voluptas consequuntur nobis laudantium amet voluptates quo ipsum aut nemo rat

Creating the ChunksΒΆ

In this notebook we will be using two different chunking strategies:

To see how different texts get chunked with different strategies and chunk sizes check out the Chunking Visualizer.

def get_recursive_splitter(chunk_size: int, chunk_overlap: int) -> TextSplitter:
    return RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", "(?<=\. )", " ", ""],
        length_function=len,
    )
# the recursive splitter mainly relies on newlines, are there even any? No, so it will focus on sentences.
sample_df["article"].map(lambda x: x.count("\n")).sum()
0

Let us set the device for efficient use of available resources.

# if we can make use of any device that is better than the CPU, we will use it
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"

model_kwargs = {'device': device, "trust_remote_code": True}
model_kwargs
{'device': 'cuda', 'trust_remote_code': True}

We select three embedding models from HuggingFace to represent our text fragments in numerical forma in a vector space.

embedding_models = {
    "mini": HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs=model_kwargs),
    "bge-m3": HuggingFaceEmbeddings(model_name="BAAI/bge-m3", model_kwargs=model_kwargs),
    "gte": HuggingFaceEmbeddings(model_name="Alibaba-NLP/gte-base-en-v1.5", model_kwargs=model_kwargs),
}

We also define the chunking strategies to be used. The recursive splittering is characterized by the length of chunks and the overlap between adjacent chunks. For the semantic chunking, sentences embedded as dense vectors are merged as long as the cosine distance between two consecutive sentences does not exceed a percentile based threshold.

recursive_256_splitter = get_recursive_splitter(256, 64)
recursive_1024_splitter = get_recursive_splitter(1024, 128)
semantic_splitter = SemanticChunker(
    embedding_models["gte"], breakpoint_threshold_type="percentile"
)
splitters = {
    "recursive_256": recursive_256_splitter,
    "recursive_1024": recursive_1024_splitter,
    "semantic": semantic_splitter
}
def chunk_documents(df: pd.DataFrame, text_splitter: TextSplitter):
    chunks = []
    id = 0
    for _, row in tqdm(df.iterrows(), total=len(df)):
        article_content = row['article']
        title = row['title']
        # we add the title to the content as it might be relevant to the question
        full_text = title + ": " + article_content
        char_chunks = text_splitter.split_text(full_text)
        for chunk in char_chunks:
                id += 1
                # add metadata to the chunk for potential later use
                metadata = {
                    'title': row['title'],
                    'url': row['url'],
                    'domain': row['domain'],
                    'id': id,
                }
                chunks.append(Document(
                    page_content=chunk,
                    metadata=metadata,
                ))
    return chunks
chunks_folder = silver_folder / "chunks"
if not chunks_folder.exists():
    chunks_folder.mkdir()

The following function will load existing chunks, prepared for our tutorial to speed up the preparation process.

def get_or_create_chunks(df: pd.DataFrame, text_splitter: TextSplitter, splitter_name: str) -> List[Document]:
    chunks_file = chunks_folder / f"{splitter_name}_chunks.json"
    if chunks_file.exists():
        with open(chunks_file, "r") as file:
            chunks = [Document(**chunk) for chunk in json.load(file)]
        print(f"Loaded {len(chunks)} chunks from {chunks_file}")
    else:
        chunks = chunk_documents(df, text_splitter)
        with open(chunks_file, "w") as file:
            json.dump([doc.dict() for doc in chunks], file, indent=4)
        print(f"Saved {len(chunks)} chunks to {chunks_file}")
    return chunks
chunks = {}
for splitter_name, splitter in splitters.items():
    chunks[splitter_name] = get_or_create_chunks(sample_df, splitter, splitter_name)
Loaded 25399 chunks from data\silver\chunks\recursive_256_chunks.json
Loaded 5754 chunks from data\silver\chunks\recursive_1024_chunks.json
Loaded 3144 chunks from data\silver\chunks\semantic_chunks.json

Now that we have created and saved the chunks we can analyze them. We can already see above that the semantic chunks are generally larger than the recursive chunks.

Analyzing the ChunksΒΆ

Let's start by looking at the first chunk of the first article to get a feeling for what the chunks look like depending on the chunking strategy and then we will look at the distribution of the chunk sizes and the number of chunks per article.

for splitter_name, splitter_chunks in chunks.items():
    print(f"{splitter_name} chunks:")
    print(wrap_text(splitter_chunks[0].page_content, char_per_line=150))
    print()
recursive_256 chunks:
Satellite Vu: Quotes, Address, Contact:  Satellite Vu will monitor the temperature of any structure on the planet in near real time. Infrared is the
next generation Earth observation sensor and Satellite Vu will be using this data to determine valuable

recursive_1024 chunks:
Satellite Vu: Quotes, Address, Contact:  Satellite Vu will monitor the temperature of any structure on the planet in near real time. Infrared is the
next generation Earth observation sensor and Satellite Vu will be using this data to determine valuable insights into economic activity, energy
efficiency and carbon footprint. This will enable better business decisions. Bad decisions are being made all over the world. These decisions are
having a global impact! Satellite Vu will change these decisions for good. The Sensi+β„’ is a laser-based analyzer used for monitoring natural gas
quality. The Cypher ES AFM from Oxford Instruments Asylum Research can be utilized for exceptional environmental control. The Vocus CI-TOF from
TOFWERK provides real-time chemical ionization measurements. In this interview, AZoCleantech speaks with Tebogo Maleka, National Project Coordinator
at the United Nations Industrial Development Organization ( UNIDO), about her role within the organization and the initiative that aims to support

semantic chunks:
Satellite Vu: Quotes, Address, Contact:  Satellite Vu will monitor the temperature of any structure on the planet in near real time. Infrared is the
next generation Earth observation sensor and Satellite Vu will be using this data to determine valuable insights into economic activity, energy
efficiency and carbon footprint. This will enable better business decisions.

def plot_chunk_lengths(chunks: List[Document], title: str):
    sns.histplot([len(chunk.page_content) for chunk in chunks], kde=True)
    plt.title(title)
    plt.xlabel("Chunk length (characters)")
    plt.ylabel("Number of chunks")
    median_chunk_len = np.median([len(chunk.page_content) for chunk in chunks])
    mean_chunk_len = np.mean([len(chunk.page_content) for chunk in chunks])
    plt.axvline(median_chunk_len, color='r', linestyle='--', label=f"Median chunk length: {median_chunk_len:.2f}")
    plt.axvline(mean_chunk_len, color='g', linestyle='--', label=f"Mean chunk length: {mean_chunk_len:.2f}")
    plt.legend()
    plt.show()
plot_chunk_lengths(chunks["recursive_256"], "Chunk lengths for recursive 256 splitter")
No description has been provided for this image
plot_chunk_lengths(chunks["recursive_1024"], "Chunk lengths for recursive 1024 splitter")
No description has been provided for this image
plot_chunk_lengths(chunks["semantic"], "Chunk lengths for semantic splitter")
No description has been provided for this image
chunks_per_article = {splitter_name: Counter([chunk.metadata["title"] for chunk in chunks]) for splitter_name, chunks in chunks.items()}
counts = {splitter_name: [count for title, count in chunk_counts.items()] for splitter_name, chunk_counts in chunks_per_article.items()}

sns.histplot(counts, kde=True)
plt.title("Number of chunks per article")
plt.xlabel("Number of chunks")
plt.ylabel("Number of articles")
plt.legend(chunks_per_article.keys())
plt.show()
No description has been provided for this image

From our analysis of our created chunks we can see that the recursive chunks are all around the same size, close to the defined maximum. On the other hand, the semantic chunks vary in size. This is because the semantic chunking strategy is based on the semantic boundaries of the article.

We can also see that despite the semantic chunks being larger, the distribution of the number of chunks per article is much wider for the recursive chunks. This is because the recursive chunks are all around the same size, while the semantic chunks have many smaller ones and a few larger ones.

Generating EmbeddingsΒΆ

Now that we have clean chunks, the next step involves generating embeddings for our article chunks. These embeddings will serve as a crucial component for efficient retrieval within the RAG pipeline. For our vector store we'll utilize ChromaDB, a powerful tool for indexing and searching high-dimensional data. To integrate our chosen embedding models with ChromaDB, we'll define a custom wrapper class. This wrapper class will act as an intermediary, ensuring seamless communication between the models and the ChromaDB indexing system.

class CustomChromadbEmbeddingFunction(EmbeddingFunction):

    def __init__(self, model) -> None:
        super().__init__()
        self.model = model

    def _embed(self, l):
        return [self.model.embed_query(x) for x in l]

    def embed_query(self, query):
        return self._embed([query])

    def __call__(self, input: Documents) -> Embeddings:
        embeddings = self._embed(input)
        return embeddings

We prepare three different embedding models in this tutorial.

chroma_embedding_functions = {
    "mini": CustomChromadbEmbeddingFunction(embedding_models["mini"]),
    "bge-m3": CustomChromadbEmbeddingFunction(embedding_models["bge-m3"]),
    "gte": CustomChromadbEmbeddingFunction(embedding_models["gte"]),
}
for name, embedding_function in chroma_embedding_functions.items():
    sample = embedding_function(["Hello, world!"])[0][:5]
    print(f"{name} embedding sample: {sample}")
mini embedding sample: [0.03492265194654465, 0.01883007027208805, -0.017854733392596245, 0.00013882208440918475, 0.07407363504171371]
bge-m3 embedding sample: [-0.016155648976564407, 0.02699340134859085, -0.042583219707012177, 0.013542206957936287, -0.01935463584959507]
gte embedding sample: [0.03789425268769264, 0.346923828125, -0.20471259951591492, -0.2123868763446808, -0.49100878834724426]

Generating embeddings can be a computationally intensive process. To optimize efficiency and avoid redundant computations, we'll leverage checkpointing. This technique involves storing the generated embeddings along with their corresponding article chunks. We'll define a simple class to encapsulate this data, facilitating efficient retrieval and reducing the need for recalculating embeddings unless absolutely necessary.

embeddings_folder = silver_folder / "embeddings"
if not embeddings_folder.exists():
    embeddings_folder.mkdir()
class DocumentEmbedding():
    def __init__(self, document: Document, text_embedding: List[float]) -> None:
        self.document = document
        self.text_embedding = text_embedding
    
    def to_dict(self) -> Dict:
        return {
            "document": self.document.dict(),
            "text_embedding": self.text_embedding
        }
    
    @classmethod
    def from_dict(cls, d: Dict) -> "DocumentEmbedding":
        return cls(
            document=Document(**d["document"]),
            text_embedding=d["text_embedding"]
        )


def get_or_create_embeddings(
        embedding_function: EmbeddingFunction,
        chunks: List[Document],
        embedding_name: str,
) -> List[DocumentEmbedding]:
    embeddings_file = embeddings_folder / f"{embedding_name}_embeddings.json"
    if embeddings_file.exists():
        with open(embeddings_file, "r") as file:
            embeddings = [DocumentEmbedding.from_dict(embedding) for embedding in json.load(file)]
        print(f"Loaded {len(embeddings)} embeddings from {embeddings_file}")
    else:
        embeddings = []
        for chunk in tqdm(chunks):
            text_embedding = embedding_function([chunk.page_content])[0]
            embedding = DocumentEmbedding(
                document=chunk,
                text_embedding=text_embedding
            )
            embeddings.append(embedding)
        with open(embeddings_file, "w") as file:
            json.dump([embedding.to_dict() for embedding in embeddings], file, indent=4)
        print(f"Saved {len(embeddings)} embeddings to {embeddings_file}")
    return embeddings
embeddings = {}
for embedding_name, embedding_function in chroma_embedding_functions.items():
    for splitter_name, splitter_chunks in chunks.items():
        embeddings[f"{embedding_name}_{splitter_name}"] = get_or_create_embeddings(
            embedding_function, splitter_chunks, f"{embedding_name}_{splitter_name}"
        )
Loaded 25399 embeddings from data\silver\embeddings\mini_recursive_256_embeddings.json
Loaded 5754 embeddings from data\silver\embeddings\mini_recursive_1024_embeddings.json
Loaded 3144 embeddings from data\silver\embeddings\mini_semantic_embeddings.json
Loaded 25399 embeddings from data\silver\embeddings\bge-m3_recursive_256_embeddings.json
Loaded 5754 embeddings from data\silver\embeddings\bge-m3_recursive_1024_embeddings.json
Loaded 3144 embeddings from data\silver\embeddings\bge-m3_semantic_embeddings.json
Loaded 25399 embeddings from data\silver\embeddings\gte_recursive_256_embeddings.json
Loaded 5754 embeddings from data\silver\embeddings\gte_recursive_1024_embeddings.json
Loaded 3144 embeddings from data\silver\embeddings\gte_semantic_embeddings.json

The number of embeddings relates to the number of chunks produced by the individual chunking strategies, not the embedding dimensions. Thus smaller chunk size (e.g. 256) yields more chunks than larger chunk size (1024), and semantic embeddings even less chunks.

Storing the Embeddings in ChromaDBΒΆ

As mentioned above for our semantic search retrieval we will be storing the embeddings in ChromaDB. ChromaDB is a powerful tool for indexing and searching high-dimensional data. It is allows e.g. to use approximate nearest neighbor (ANN) search based on the Hierarchical Navigable Small World (HNSW) algorithm, which is known for its efficiency in searching high-dimensional spaces.

Just like with normal SQL databases we have a server, in this case an SQLite server, that we can connect to with a client. We will then use the client to connect to the server and create for each set of embeddings a new seperate database which can be thought of as the index or a vector space. ChromaDB calls these separate vector spaces "collections". These collections will then be used to search for the most relevant chunks to a user query.

semantic search

gold_folder = data_folder / "gold"
if not gold_folder.exists():
    gold_folder.mkdir()
chromadb_folder = gold_folder / "chromadb"
if not chromadb_folder.exists():
    chromadb_folder.mkdir()

chroma_client = chromadb.PersistentClient(path=chromadb_folder.as_posix())

Again we can make use of preprocessed data as before to speed up the preparatory steps.

def get_or_create_collection(
        name: str,
        embedding_function: EmbeddingFunction,
        embeddings: List[DocumentEmbedding],
        batch_size: int = 128
) -> Collection:

    collection = chroma_client.get_or_create_collection(
        name=name,
        # configure to use cosine distance not default L2
        metadata={"hnsw:space": "cosine"},
        embedding_function=embedding_function
    )

    if collection.count() == 0:
        for i in tqdm(range(0, len(embeddings), batch_size)):
            batch = embeddings[i:i+batch_size]
            collection.add(
                documents=[embedding.document.page_content for embedding in batch],
                embeddings=[embedding.text_embedding for embedding in batch],
                ids=[str(embedding.document.metadata["id"]) for embedding in batch],
                metadatas=[embedding.document.metadata for embedding in batch]
            )

    return collection
collections = {}
for collection_name, current_embeddings in embeddings.items():
    collection = get_or_create_collection(
        collection_name,
        chroma_embedding_functions[collection_name.split("_")[0]],
        current_embeddings
    )
    collections[collection_name] = collection
    print(f"Collection {collection_name} has {collection.count()} documents")
Collection mini_recursive_256 has 25399 documents
Collection mini_recursive_1024 has 5754 documents
Collection mini_semantic has 3144 documents
Collection bge-m3_recursive_256 has 25399 documents
Collection bge-m3_recursive_1024 has 5754 documents
Collection bge-m3_semantic has 3144 documents
Collection gte_recursive_256 has 25399 documents
Collection gte_recursive_1024 has 5754 documents
Collection gte_semantic has 3144 documents

The above printout shows the three embedding models applied to the three chunking strategies.

Once we have stored all the embeddings in ChromaDB we can test the retrieval process by querying one of our collections and see what the most similar chunks are. Try some different queries and see what the most similar chunks are and whether they make sense.

selected_collection = collections["gte_recursive_1024"]
results = selected_collection.query(
    query_texts=["Climate Change"],
    n_results=3,
)
for doc in results["documents"][0]:
    print(wrap_text(doc))
    print()
Climate Change Archives - Page 5 of 63: Southern countries are pushing hard to make transparent the
wealth and climate consequences of burning fossil fuels. Bill McKibben says it's clear how
impeachably... While I watched the chilled host on the Macy’ s Day Parade television broadcast talk
about Tofurky as a vegan Thanksgiving substitute, I can’ t say... A turkey is a symbol of US
Thanksgiving dinner traditions. But how do you make flexitarians -- guests who prefer vegetarian or
vegan eating... For the first time ever, formal discussions took place at the annual climate
convention about food security. The consensus is that, in order to... The new Chris Hemsworth
project `` Limitless '' is the perfect antidote to climate doomerism ( with bonus energy storage
angle, of course). Food security threatens many regions around the world. Puerto Rico's decades of
dependence on outside food imports has impacted the health and resilience of... Engineers working
on hydrogen, evtols, UAM, vertiports, hypersonic passenger

scenario used in the study is unlikely because of global efforts to limit greenhouse gas emissions,
the findings reveal a previously unknown tipping point that if activated would release an important
brake on global warming, the authors said. `` We need to think about these worst-case scenarios to
understand how our CO2 emissions might affect the oceans not just this century, but next century
and the following century, '' said Megumi Chikamoto, who led the research as a research fellow at
the University of Texas Institute for Geophysics. The study was published in the journal
Geophysical Research Letters. Today, the oceans soak up about a third of the CO2 emissions
generated by humans. Climate simulations had previously shown that the oceans slow their absorption
of CO2 over time, but none had considered alkalinity as explanation. To reach their conclusion, the
researchers recalculated pieces of a 450-year simulation until they hit on alkalinity as a key
cause of the slowing. According to the findings, the

Potential Climatic Impact of Nord Stream Methane Leaks:  Nord Stream 1 and 2, two subsea pipelines
that transport natural gas from Russia to Germany, were both intentionally destroyed on September
26th, 2022. Enormous amounts of gases, mainly methane, were discharged into the ocean and
eventually into the atmosphere. Methane escaping from sabotaged pipelines in the Baltic Sea (
September 27th, 2022). Image Credit: Danish Armed Forces Methane is the second most prevalent
anthropogenic greenhouse gas after CO2, although its greenhouse effect is substantially stronger.
As a result, whether this catastrophe may have detrimental climatic consequences is a major issue
around the world. This problem was discussed in a news article published in Nature, but no
quantitative implications were reached. Recently, scientists from the Chinese Academy of Sciences’
Institute of Atmospheric Physics approximated the potential climatic effect of leaked methane using
the energy-conservation framework of the Intergovernmental

Analyzing the Embedding SpaceΒΆ

To gain a better understandign of how the retrieval process works we will analyze the embedding space. We will start by projecting the embeddings into a 2D space using UMAP. UMAP is a dimensionality reduction technique that is particularly well-suited for visualizing high-dimensional data in a lower-dimensional space. The most notable advantages over other dimensionality reduction techniques are increased speed and better preservation of the data's global structure. We will then use the UMAP embeddings to create a scatter plot of the chunks.

def get_vectors_from_collection(collection: Collection):
    stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
    return np.array(stored_chunks["embeddings"])

def get_vectors_by_domain(collection: Collection, domain: str):
    stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
    metadatas = stored_chunks["metadatas"]
    indices = [str(metadata["id"]) for metadata in metadatas if metadata["domain"] == domain]
    return collection.get(include=["embeddings"], ids=indices)["embeddings"]

def fit_umap(vectors: np.ndarray):
    return umap.UMAP().fit(vectors)

def project_embeddings(embeddings, umap_transform):
    return umap_transform.transform(embeddings)
vectors = get_vectors_from_collection(selected_collection)
print(f"Original shape: {vectors.shape}")
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
print(f"Projected shape: {vectors_projections.shape}")
Original shape: (5754, 768)
Projected shape: (5754, 2)

The dimensions above show how the chunked embeddings with 768 dimensions are reduced to two dimensions for visualization purposes.

You can zoom in the plot by clicking and dragging a box around the area you want to zoom in on. You can also reset the plot by double clicking on the plot.

fig = px.scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1])
fig.show()

Next we will color the embeddings by the domain of the article to see if there are any patterns or clusters in the embedding space based on the domain.

fig = go.Figure()
for domain in sample_df["domain"].unique():
    domain_vectors = get_vectors_by_domain(selected_collection, domain)
    domain_projections = project_embeddings(domain_vectors, umap_transform)
    fig.add_trace(go.Scatter(x=domain_projections[:, 0], y=domain_projections[:, 1], mode='markers', marker=dict(size=4), name=domain))

fig.show()

We can also visualize the retrieval process by plotting the query and the most similar chunks in the embedding space. This will give us a better understanding of how the retrieval process works and how the most similar chunks are found.

Note that the UMAP projection uses a metric approach which differs from the approximate nearest neighbor approach used for retrieval. Also don't forget that the embeddings are in a high-dimensional space and we are only visualizing a 2D projection of them so the distances between the points might not be accurate. Try some different queries and see how the most similar chunks are found.

def plot_retrieval_results(
        query: str,
        selected_collection: Collection,
        n_results: int = 5
):
    vectors = get_vectors_from_collection(selected_collection)
    umap_transform = fit_umap(vectors)
    vectors_projections = project_embeddings(vectors, umap_transform)

    query_embedding = selected_collection._embedding_function([query])[0]
    query_embedding = np.array(query_embedding).reshape(1, -1)
    query_projection = project_embeddings(query_embedding, umap_transform)

    nearest_neighbors = selected_collection.query(
        query_texts=[query],
        n_results=n_results,
    )
    neighbor_vectors = selected_collection.get(include=["embeddings"], ids=nearest_neighbors["ids"][0])["embeddings"]
    neighbor_projections = project_embeddings(neighbor_vectors, umap_transform)
   

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
    fig.add_trace(go.Scatter(x=neighbor_projections[:, 0], y=neighbor_projections[:, 1], mode='markers', marker=dict(size=5, color='orange'), name="nearest neighbors"))
    fig.add_trace(go.Scatter(x=query_projection[:, 0], y=query_projection[:, 1], mode='markers', marker=dict(size=10, color='red', symbol='x'), name="query"))

    fig.show()
plot_retrieval_results(
    "Climate Change",
    selected_collection,
)

Lastly we will analyze the distribution of the cosine distances between the query and the different chunks. This will give us a better understanding of the cosine distance and show that the distances in the high-dimensional space are not the same as in the 2D projection. Do not confuse the cosine distance with the cosine similarity. The cosine similarity is the cosine of the angle between two vectors and the cosine distance is 1 minus the cosine similarity so that smaller numbers mean the vectors are more similar.

def cosine_distance(vector1, vector2):
    dot_product = np.dot(vector1, vector2.T)
    norm_product = np.linalg.norm(vector1) * np.linalg.norm(vector2)
    similarity = dot_product / norm_product
    return 1 - similarity

def plot_cosine_distances(
        query: str,
        selected_collection: Collection
):
    vectors = get_vectors_from_collection(selected_collection)
    umap_transform = fit_umap(vectors)
    vectors_projections = project_embeddings(vectors, umap_transform)

    query_embedding = selected_collection._embedding_function([query])[0]
    query_embedding = np.array(query_embedding).reshape(1, -1)
    query_projection = project_embeddings(query_embedding, umap_transform)

    similarities = np.array([cosine_distance(query_embedding, vector) for vector in vectors])

    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=vectors_projections[:, 0],
        y=vectors_projections[:, 1],
        mode='markers',
        marker=dict(
            size=5,
            color=similarities.flatten(),
            colorscale='RdBu',
            colorbar=dict(title='Cosine Distance')
        ),
        text=['Cosine Distance: {:.4f}'.format(
            sim) for sim in similarities.flatten()],
        name='Other Vectors'
    ))

    fig.add_trace(go.Scatter(x=[query_projection[0][0]], y=[
                query_projection[0][1]], mode='markers', marker=dict(size=10, color='black', symbol='x'), text=['Query Vector'], name='Query Vector'))

    fig.show()
plot_cosine_distances(
    "Climate Change",
    selected_collection,
)

Putting it all TogetherΒΆ

Now that we have generated the embeddings and stored them in ChromaDB we can put it all together and create the RAG pipeline. The RAG pipeline consists of the following steps:

How does Langchain work?ΒΆ

In this notebook we will be using Langchain to build up our pipeline. You do not need a library like Langchain or LlamaIndex to build a RAG pipeline, but it can make the process easier.

The idea of Langchain and its LCEL (Langchain Expression Language) is very simple. Within the pipeline there are lots of steps that take an input and produce an output. These steps can be chained together to form a pipeline. The LCEL is a simple language that allows you to define these steps and how they are connected. For more technical details on how Langchain works check out the Langchain Documentation.

In simple terms langchain provides an abstraction of a step that has an invoke method that takes an input, a dictionary of parameters and returns an output also a dictionary. This allows you to chain together different steps and define how they are connected and also split of chains of steps into separate pipelines.

Below you can see an overview of our RAG pipeline:

rag_pipeline

And now let's look at the implementation of the RAG pipeline.

def create_qa_chain(retriever: BaseRetriever):
    template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. \
    If you don't know the answer, just say that you don't know. Keep the answer concise.

    Question: {question}
    Context: {context}
    Answer:
    """
    rag_prompt = ChatPromptTemplate.from_template(template)

    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    rag_chain = RunnableParallel(
        {
            "context": retriever,
            "question": RunnablePassthrough()
        }
    ).assign(answer=(
         RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
            | rag_prompt
            | llm
            | StrOutputParser()
    ))

    return rag_chain

For Langchain to work with our ChromaDB collections we need to transform the collections into a format that Langchain can work with so called stores and retrievers.

def collection_to_store(collection_name: str, lc_embedding_model: EmbeddingFunction):
    return Chroma(
        client=chroma_client,
        collection_name=collection_name,
        embedding_function=lc_embedding_model,
    )

def store_to_retriever(store: VectorStore, k: int = 3):
    retriever = store.as_retriever(
        search_type="similarity", search_kwargs={'k': k}
    )
    return retriever
selected_store = collection_to_store("gte_recursive_1024", embedding_models["gte"])
selected_retriever = store_to_retriever(selected_store)
selected_retriever.invoke("Climate Change")
[Document(page_content="Climate Change Archives - Page 5 of 63: Southern countries are pushing hard to make transparent the wealth and climate consequences of burning fossil fuels. Bill McKibben says it's clear how impeachably... While I watched the chilled host on the Macy’ s Day Parade television broadcast talk about Tofurky as a vegan Thanksgiving substitute, I can’ t say... A turkey is a symbol of US Thanksgiving dinner traditions. But how do you make flexitarians -- guests who prefer vegetarian or vegan eating... For the first time ever, formal discussions took place at the annual climate convention about food security. The consensus is that, in order to... The new Chris Hemsworth project `` Limitless '' is the perfect antidote to climate doomerism ( with bonus energy storage angle, of course). Food security threatens many regions around the world. Puerto Rico's decades of dependence on outside food imports has impacted the health and resilience of... Engineers working on hydrogen, evtols, UAM, vertiports, hypersonic passenger", metadata={'domain': 'cleantechnica', 'id': 2262, 'title': 'Climate Change Archives - Page 5 of 63', 'url': 'cleantechnica.com/tag/climate-change/page/5'}),
 Document(page_content="scenario used in the study is unlikely because of global efforts to limit greenhouse gas emissions, the findings reveal a previously unknown tipping point that if activated would release an important brake on global warming, the authors said. `` We need to think about these worst-case scenarios to understand how our CO2 emissions might affect the oceans not just this century, but next century and the following century, '' said Megumi Chikamoto, who led the research as a research fellow at the University of Texas Institute for Geophysics. The study was published in the journal Geophysical Research Letters. Today, the oceans soak up about a third of the CO2 emissions generated by humans. Climate simulations had previously shown that the oceans slow their absorption of CO2 over time, but none had considered alkalinity as explanation. To reach their conclusion, the researchers recalculated pieces of a 450-year simulation until they hit on alkalinity as a key cause of the slowing. According to the findings, the", metadata={'domain': 'azocleantech', 'id': 482, 'title': 'Global Warming Could Trigger Chemical Changes in the Ocean Surface that Accelerate Climate Change', 'url': 'azocleantech.com/news.aspx?newsID=33053'}),
 Document(page_content='Potential Climatic Impact of Nord Stream Methane Leaks:  Nord Stream 1 and 2, two subsea pipelines that transport natural gas from Russia to Germany, were both intentionally destroyed on September 26th, 2022. Enormous amounts of gases, mainly methane, were discharged into the ocean and eventually into the atmosphere. Methane escaping from sabotaged pipelines in the Baltic Sea ( September 27th, 2022). Image Credit: Danish Armed Forces Methane is the second most prevalent anthropogenic greenhouse gas after CO2, although its greenhouse effect is substantially stronger. As a result, whether this catastrophe may have detrimental climatic consequences is a major issue around the world. This problem was discussed in a news article published in Nature, but no quantitative implications were reached. Recently, scientists from the Chinese Academy of Sciences’ Institute of Atmospheric Physics approximated the potential climatic effect of leaked methane using the energy-conservation framework of the Intergovernmental', metadata={'domain': 'azocleantech', 'id': 463, 'title': 'Potential Climatic Impact of Nord Stream Methane Leaks', 'url': 'azocleantech.com/news.aspx?newsID=32568'})]

Now that we have our retriever we can create our RAG pipeline. Try some different queries and see how the pipeline responds.

selected_chain = create_qa_chain(selected_retriever)
selected_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='Blue River, Vida, Phoenix, and Talentβ€”were lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 5660, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing what’ s known as the β€œ vapor pressure deficit, ” or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isn’ t the only factor behind the west’ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 5661, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This year’ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 5662, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
 'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
 'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
chains = {}
for collection_name, collection in collections.items():
    store = collection_to_store(collection_name, embedding_models[collection_name.split("_")[0]])
    retriever = store_to_retriever(store)
    chain = create_qa_chain(retriever)
    chains[collection_name] = chain

chains.keys()
dict_keys(['mini_recursive_256', 'mini_recursive_1024', 'mini_semantic', 'bge-m3_recursive_256', 'bge-m3_recursive_1024', 'bge-m3_semantic', 'gte_recursive_256', 'gte_recursive_1024', 'gte_semantic'])

EvaluationΒΆ

Because we have many hyperparameters such as chunk size, prompts etc. to tune and different strategies to try we will use the RAGAS (RAG Assesment) framework to evaluate our pipeline. RAGAS is a framework that allows you to evaluate your RAG pipeline with an LLM as a judge and other metrics that also utilize embedding models. We will go more into detail on the metrics later on.

Before we can start the evaluation we need to define the evaluation questions and their ground truth answers. For this we will use the provided evaluation questions. To increase our question pool we will also generate some additional question and answer pairs based on a random chunk and utilizing the LLM (GPT-4o) to generate the question and answer.

human_eval_df.head()
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... sgvoice.energyvoice.com/strategy/technology/23...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... sgvoice.energyvoice.com/policy/25396/eu-seeks-...
3 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... pv-magazine.com/2023/02/02/european-commission...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... sgvoice.energyvoice.com/policy/25396/eu-seeks-...
5 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... cleantechnica.com/2023/05/08/general-motors-se...

As we are only given questions and the relevant sections of the articles we need to generate the answers to the questions. We will use the LLM (GPT-4o) to generate the answers to the questions.

def generate_eval_answers(df: pd.DataFrame) -> pd.DataFrame:
    answer_geneation_prompt = """Answer the following question based on the article:
    Question: {question}
    Article: {article}
    """
    answer_generation_chain = ChatPromptTemplate.from_template(answer_geneation_prompt) | llm
    for i, row in tqdm(df.iterrows(), total=len(df)):
        df.at[i, "ground_truth"] = answer_generation_chain.invoke({"question": row["question"], "article": row["relevant_section"]}).content
    return df
if (silver_folder / "human_eval.csv").exists():
    human_eval_df = pd.read_csv(silver_folder / "human_eval.csv")
else:
    human_eval_df = generate_eval_answers(human_eval_df)
    human_eval_df.to_csv(silver_folder / "human_eval.csv", index=False)

human_eval_df.head()
question relevant_section url ground_truth
0 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... sgvoice.energyvoice.com/strategy/technology/23... The innovation behind LeclanchΓ©'s new method t...
1 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... sgvoice.energyvoice.com/policy/25396/eu-seeks-... The EU’s Green Deal Industrial Plan is an init...
2 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... pv-magazine.com/2023/02/02/european-commission... The EU’s Green Deal Industrial Plan is aimed a...
3 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... sgvoice.energyvoice.com/policy/25396/eu-seeks-... The four focus areas of the EU's Green Deal In...
4 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... cleantechnica.com/2023/05/08/general-motors-se... The cooperation between GM and Honda on fuel c...

We will now generate some synthetic questions and answers based on random chunks. So we will give the LLM a random chunk and ask it to generate a question and answer based on the chunk.

def generate_synthetic_qa_pairs(documents: List[Document], n: int = 10) -> List[str]:
    synthetic_questions = []
    documents = np.random.choice(documents, n)

    question_generation_prompt = """Generate a short and general question based on the following news article:
    Article: {article}
    """
    question_generation_chain = ChatPromptTemplate.from_template(question_generation_prompt) | llm

    answer_geneation_prompt = """Answer the following question based on the article:
    Question: {question}
    Article: {article}
    """
    answer_generation_chain = ChatPromptTemplate.from_template(answer_geneation_prompt) | llm


    for document in tqdm(documents):
        element = {}
        content = document.page_content
        element["relevant_section"] = content
        element["url"] = document.metadata["url"]
        question = question_generation_chain.invoke({"article": content}).content
        element["question"] = question
        answer = answer_generation_chain.invoke({"question": question, "article": content}).content
        element["ground_truth"] = answer
        synthetic_questions.append(element)

    return pd.DataFrame(synthetic_questions)
if not (silver_folder / "synthetic_eval.csv").exists():
    synthetic_eval_df = generate_synthetic_qa_pairs(chunks["recursive_1024"], 25)
    synthetic_eval_df.to_csv(silver_folder / "synthetic_eval.csv", index=False)
else:
    synthetic_eval_df = pd.read_csv(silver_folder / "synthetic_eval.csv", index_col=0)
synthetic_eval_df.head()
url question ground_truth
relevant_section
for hybrid and fully battery electric vehicles, aiming to bring the industry closer to achieving the key tipping points for mainstream electric vehicle ( EV) adoption. Castrol has been working on advanced EV fluids designed to manage temperatures within Li-ON cells to enable ultra-fast charging and better efficiency. Meanwhile, in its so-called Big Battery Challenge, the UK’ s Institute of Mechanical Engineering ( IMechE) experts have determined that, while it is likely the Li-ON battery will dominate for the time being, β€œ there are plenty of potential long-term challengers ”. Three contenders are especially identified: sodium ion ( Na-ON), solid state and Lithium-sulphur ( Li-S). Sodium-ion batteries are regarded as an emerging technology with β€œ promising cost, safety, sustainability and performance advantages ” over commercialised lithium-ion batteries. According to IMechE material: β€œ Key advantages include the use of widely available and inexpensive raw materials and a rapidly scaleable technology based energyvoice.com/technology/446761/batteries-te... What are the potential long-term challengers t... The potential long-term challengers to lithium...
and content of the checkpoint. Using the feedback, the checkpoint will then be established as a new measure to assess potential future licences. It will ensure any future licences are granted on the basis that they are compatible with the UK’ s goal to become net zero by 2050. If the evidence suggests that a future licensing round would undermine progress towards that target, it would not go ahead, UK Government said. The new checkpoint will add an additional layer of scrutiny to future licences, on top of the existing measures that already apply to UK oil and gas developments. Operators currently have to adhere to regulations enforced by the Offshore Petroleum Regulator for Environment and Decommission ( OPRED), as well as the net zero impact assessment carried out by the OGA as part of its consent process for new licences. Malcolm Offord, UK Government minister for Scotland, said: β€œ The UK Government fully supports the oil and gas industry in its transition away from fossil fuels to cleaner, greener energy energyvoice.com/oilandgas/north-sea/374073/uk-... How will the new checkpoint affect the process... The new checkpoint will affect the process of ...
France ( 103 days) and the Netherlands ( 123 days). Centrica said it had completed β€œ significant engineering upgrades ” over the summer and in August was given the go-ahead by the offshore regulator North Sea Transition Authority ( NSTA) to reopen the site. This was followed by commissioning over the early autumn, enabling it to make its first injection of gas into the site in over five years. The work done so far means that Rough is operating at around 20% of its previous capacity this winter, immediately making it the UK’ s largest gas storage site once again and adding 50% to the UK’ s gas storage volume. The operator now says its long-term aim is to turn the Rough gas field into β€œ the largest long duration energy storage facility in Europe ”, capable of storing both natural gas and hydrogen – a major turnaround in fortunes for the previously mothballed site. Centrica group chief executive Chris O’ Shea said: β€œ I’ m delighted that we have managed to return Rough to storage operations for this winter energyvoice.com/oilandgas/north-sea/455701/cen... What are the implications of reopening the Rou... Reopening the Rough gas field has significant ...
Going underground: how solar sites can boost biodiversity: The UK’ s biodiversity crisis stands in the shadow of our energy price crisis – but both challenges can be addressed through renewable energy. Mark Rowcroft, Development Director at solar and battery storage developer Exagen, explains how reaching our full solar energy potential means looking not only to the skies, but to the soil. Boosting UK renewable energy is a key route to tackling the energy price crisis, with solar power the cheapest form of electricity today. The Prime Minister’ s COP27 speech reaffirmed his commitment to clean energy and, with the UK targeting 70GW of solar generation by 2035, huge potential exists to grow solar generation. Yet misconceptions stubbornly remain: such as the argument frequently presented by opponents of solar farms that they β€œ industrialise the land ”, without realising the extent to which UK farmland is already industrialised. A common misconception is that UK farmland is bursting with wildlife and sgvoice.energyvoice.com/policy/18796/going-und... How can solar energy sites contribute to addre... Solar energy sites can contribute to addressin...
solar manufacturer Kaneka as a supplier for solar cell deployment in one of its electric vehicles. Kaneka's solar cels have been for years recognized as the most efficient crystalline silicon PV device developed at both the industry and research levels. However, Chinese manufacturer Longi said last November that it had crossed reached a power conversion efficiency of 26.81% with an unspecified heterojunction ( HJT) solar cell, based on a full-size silicon wafer, in mass production. This content is protected by copyright and may not be reused. If you want to cooperate with us and would like to reuse some of our content, please contact: editors @ pv-magazine.com. Please be mindful of our community standards. Your email address will not be published. Required fields are marked * Save my name, email, and website in this browser for the next time I comment. By submitting this form you agree to pv magazine using your data for the purposes of publishing your comment. Your personal data will only be disclosed or pv-magazine.com/2023/07/03/enecoat-toyota-deve... What advancements have been made in solar cell... Companies like Kaneka and Longi have made sign...
question_length = {
    "human": human_eval_df["question"].map(len),
    "synthetic": synthetic_eval_df["question"].map(len)
}

sns.histplot(question_length, kde=True)
plt.title("Question Length Distribution")
plt.xlabel("Question Length")
plt.ylabel("Count")
plt.show()
No description has been provided for this image
eval_df = pd.concat([human_eval_df, synthetic_eval_df], ignore_index=True)
eval_df["is_synthetic"] = eval_df["relevant_section"].isna()
eval_df["is_synthetic"].value_counts()
is_synthetic
True     25
False    23
Name: count, dtype: int64

Now we have doubled the number of questions and answers. However, we can see that our synthetic questions are slightly longer than the provided questions which could mean that they are slightly easier to answer. This potential bias should be taken into account when evaluating the pipeline.

RAGAS MetricsΒΆ

RAGAS provides a variety of metrics to evaluate the performance of a RAG pipeline. Here are some of the key metrics we will be using and how they are calculated:

ragas-metrics

For this to work we create a test dataset for each of our RAG pipelines that contains the evaluation questions and their ground truth answers. We then run all the questions through our RAG pipeline and store the generated answers and the retrieved chunks. We can then use this test dataset to calculate the RAGAS metrics.

datasets_folder = gold_folder / "datasets"
if not datasets_folder.exists():
    datasets_folder.mkdir()

def get_or_create_eval_dataset(name: str, df: pd.DataFrame, chain: Chain) -> Dataset:
    dataset_file = datasets_folder/ f"{name}_dataset.json"
    if dataset_file.exists():
        with open(dataset_file, "r") as file:
            dataset = Dataset.from_dict(json.load(file))
        print(f"Loaded {name} dataset from {dataset_file}")
    else:
        datapoints = {
            "question": df["question"].tolist(),
            "answer": [],
            "contexts": [],
            "ground_truth": df["ground_truth"].tolist(),
            "context_urls": []
        }
        for question in tqdm(datapoints["question"]):
            result = chain.invoke(question)
            datapoints["answer"].append(result["answer"])
            datapoints["contexts"].append([str(doc.page_content) for doc in result["context"]])
            datapoints["context_urls"].append([doc.metadata["url"] for doc in result["context"]])
        dataset = Dataset.from_dict(datapoints)
        with open(dataset_file, "w") as file:
            json.dump(dataset.to_dict(), file)
        print(f"Saved {name} dataset to {dataset_file}")
    return dataset
results_folder = gold_folder / "results"
if not results_folder.exists():
    results_folder.mkdir()

def get_or_run_llm_eval(name: str, dataset: Dataset, llm_judge_model: LLM) -> pd.DataFrame:
    eval_results_file = results_folder / f"{name}_llm_eval_results.csv"
    if eval_results_file.exists():
        eval_results = pd.read_csv(eval_results_file)
        print(f"Loaded {name} evaluation results from {eval_results_file}")
    else:
        eval_results = evaluate(dataset,
                                metrics=[faithfulness, answer_relevancy, context_relevancy, answer_correctness],
                                is_async=True,
                                llm=llm_judge_model,
                                embeddings=embedding_models["gte"],
                                run_config=RunConfig(
                                    timeout=60, max_retries=10, max_wait=60, max_workers=8),
                                ).to_pandas()
        eval_results.to_csv(eval_results_file, index=False)
        print(f"Saved {name} evaluation results to {eval_results_file}")
    return eval_results
def plot_llm_eval(name: str, eval_results: pd.DataFrame):
    # select only the float64 columns (assuming these are the RAGAS metrics)
    ragas_metrics_data = (eval_results
                        .select_dtypes(include=[np.float64]))


    # boxplot of distributions
    sns.boxplot(data=ragas_metrics_data, palette="Set2")
    plt.title(f'{name}: Distribution of RAGAS Evaluation Metrics')
    plt.ylabel('Scores')
    plt.xlabel('Metrics')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

    # barplot of means
    means = ragas_metrics_data.mean()
    plt.figure(figsize=(14, 8))
    sns.barplot(x=means.index, y=means, palette="Set2")
    plt.title(f'{name}: Mean of RAGAS Evaluation Metrics')
    plt.ylabel('Mean Scores')
    plt.xlabel('Metrics')
    plt.xticks(rotation=45)

    plt.tight_layout()
    plt.show()
def plot_multiple_evals(eval_results: Dict[str, pd.DataFrame]):
    # combine the results
    full_results = []
    for name, results in eval_results.items():
        results['name'] = name
        full_results.append(results)

    full_results = pd.concat(full_results, ignore_index=True)
    full_results = full_results.sort_values(by='name')


    # select only the float64 columns (assuming these are the RAGAS metrics)
    ragas_metrics_data = full_results.select_dtypes(include=[np.float64])
    ragas_metrics_data['name'] = full_results['name']
    
    # boxplot of distributions
    plt.figure(figsize=(14, 8))
    sns.boxplot(x='variable', y='value', hue='name', data=pd.melt(ragas_metrics_data, id_vars='name'), palette="Set2")
    plt.title('Distribution of RAGAS Evaluation Metrics by Model')
    plt.ylabel('Scores')
    plt.xlabel('Metrics')
    plt.xticks(rotation=45)
    plt.legend(title='Model')
    plt.tight_layout()
    plt.show()
    
    # barplot of means
    means = ragas_metrics_data.groupby('name').mean().reset_index()
    means_melted = pd.melt(means, id_vars='name')
    
    plt.figure(figsize=(14, 8))
    sns.barplot(x='variable', y='value', hue='name', data=means_melted, palette="Set2")
    plt.title('Mean of RAGAS Evaluation Metrics by Model')
    plt.ylabel('Mean Scores')
    plt.xlabel('Metrics')
    plt.xticks(rotation=45)
    plt.legend(title='Model')
    plt.tight_layout()
    plt.show()
selected_dataset = get_or_create_eval_dataset("selected", eval_df, selected_chain)
Loaded selected dataset from data\gold\datasets\selected_dataset.json

As a judge we use the GPT-4o-mini model. This model is a smaller version of the GPT-4o model. Whilst it is not as powerful as the full GPT-4o model it is still a very powerful model and can be used to evaluate the performance of our RAG pipeline without having to high costs.

It has also been suggest in Literature that when evaluating LLMs with LLMS as judges the evaluation is more reliable when the judge a different model than the model being evaluated. This is because the models might have learned to exploit the weaknesses of the other model or have a certain bias to there own answers. https://arxiv.org/abs/2404.13076

judge = ChatOpenAI(model="gpt-4o-mini")
question_prompt = ChatPromptTemplate.from_template(
    "Answer the following question: {question}")
question_chain = question_prompt | judge | StrOutputParser()
question_chain.invoke({"question": "What is the meaning of life?"})
'The meaning of life is a philosophical question that has been contemplated by humans for centuries. Different cultures, religions, and individuals have offered various interpretations. Some people find meaning through relationships, love, and connection with others, while others seek purpose through personal achievements, spirituality, or contributing to the greater good.\n\nIn existential philosophy, the meaning of life is often seen as something that each person must define for themselves. This perspective emphasizes personal responsibility and the idea that individuals create their own meaning through their choices and actions. \n\nUltimately, the meaning of life can be deeply personal and subjective, varying greatly from one person to another. It may encompass a combination of experiences, beliefs, values, and aspirations that resonate with an individual’s understanding of their existence.'
selected_llm_eval_results = get_or_run_llm_eval("selected", selected_dataset, llm)
plot_llm_eval("selected", selected_llm_eval_results)
Loaded selected evaluation results from data\gold\results\selected_llm_eval_results.csv
No description has been provided for this image No description has been provided for this image
datasets = {}
for name, chain in chains.items():
    datasets[name] = get_or_create_eval_dataset(name, eval_df, chain)
Loaded mini_recursive_256 dataset from data\gold\datasets\mini_recursive_256_dataset.json
Loaded mini_recursive_1024 dataset from data\gold\datasets\mini_recursive_1024_dataset.json
Loaded mini_semantic dataset from data\gold\datasets\mini_semantic_dataset.json
Loaded bge-m3_recursive_256 dataset from data\gold\datasets\bge-m3_recursive_256_dataset.json
Loaded bge-m3_recursive_1024 dataset from data\gold\datasets\bge-m3_recursive_1024_dataset.json
Loaded bge-m3_semantic dataset from data\gold\datasets\bge-m3_semantic_dataset.json
Loaded gte_recursive_256 dataset from data\gold\datasets\gte_recursive_256_dataset.json
Loaded gte_recursive_1024 dataset from data\gold\datasets\gte_recursive_1024_dataset.json
Loaded gte_semantic dataset from data\gold\datasets\gte_semantic_dataset.json
llm_results = {}
for dataset_name, dataset in datasets.items():
    llm_results[dataset_name] = get_or_run_llm_eval(dataset_name, dataset, llm)
Loaded mini_recursive_256 evaluation results from data\gold\results\mini_recursive_256_llm_eval_results.csv
Loaded mini_recursive_1024 evaluation results from data\gold\results\mini_recursive_1024_llm_eval_results.csv
Loaded mini_semantic evaluation results from data\gold\results\mini_semantic_llm_eval_results.csv
Loaded bge-m3_recursive_256 evaluation results from data\gold\results\bge-m3_recursive_256_llm_eval_results.csv
Loaded bge-m3_recursive_1024 evaluation results from data\gold\results\bge-m3_recursive_1024_llm_eval_results.csv
Loaded bge-m3_semantic evaluation results from data\gold\results\bge-m3_semantic_llm_eval_results.csv
Loaded gte_recursive_256 evaluation results from data\gold\results\gte_recursive_256_llm_eval_results.csv
Loaded gte_recursive_1024 evaluation results from data\gold\results\gte_recursive_1024_llm_eval_results.csv
Loaded gte_semantic evaluation results from data\gold\results\gte_semantic_llm_eval_results.csv
plot_multiple_evals(llm_results)
No description has been provided for this image No description has been provided for this image
mean_scores = {}
for name, results in llm_results.items():
    mean_scores[name] = results.select_dtypes(include=[np.float64]).mean()

total_mean_scores = pd.DataFrame(mean_scores).mean()
total_mean_scores.sort_values(ascending=False)
bge-m3_recursive_1024    0.648580
gte_recursive_1024       0.647773
bge-m3_semantic          0.626282
gte_recursive_256        0.624573
mini_recursive_1024      0.622090
gte_semantic             0.605386
mini_semantic            0.601130
bge-m3_recursive_256     0.588193
mini_recursive_256       0.558627
dtype: float64

From the evaluation we can see that the RAG pipeline using the GTE embedding model by alibaba or BGE-M3 model along with recursive chunking with a chunk size of 1024 have on average across the metrics the best performance. This is likely due to the fact that these embedding models are the most powerful and the recursive chunking with a chunk size of 1024 provides enough context to the LLM but not too much that it gets distracted.

best_collection = collections["gte_recursive_1024"]
best_store = collection_to_store("gte_recursive_1024", embedding_models["gte"])

Advanced MethodsΒΆ

In this final section we will look at some more advanced methods to improve our RAG pipeline and comparing them to our best performing pipeline.

Multi-QueryingΒΆ

Multi-querying is a technique that involves querying the retrieval model with multiple questions to retrieve relevant chunks. This approach can enhance the retrieval process by leveraging the diversity of queries to capture a broader range of relevant information. By combining the results from multiple queries, we can potentially improve the quality of the retrieved chunks and, consequently, the generated responses. When creating these additional queries the goal is to create queries that are different from the original query but still relevant to the user's information need, i.e variations of the original query.

multi-querying

def generate_query_variations(query: str, num_additional_queries: int) -> List[str]:
    multiquery_prompt = """You are an assistant tasked with generating {num_queries} \
    different versions of the given user question to retrieve relevant documents from a vector \
    database. By generating multiple perspectives on the user question and breaking it down, your goal is to help \
    the user overcome some of the limitations of the distance-based similarity search. \
    Provide these alternative questions separated by newlines without any numbering or listing.
    Original question: {question}
    Alternatives:
    """

    multiquery_chain = ChatPromptTemplate.from_template(multiquery_prompt) | llm
    return multiquery_chain.invoke({"question": query, "num_queries": num_additional_queries}).content.split("\n")
def plot_multiquery_retrieval_results(query: str, collection : Collection, num_additional_queries: int = 3, num_results: int = 3):
    vectors = get_vectors_from_collection(collection)
    umap_transform = fit_umap(vectors)
    vectors_projections = project_embeddings(vectors, umap_transform)

    query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)

    query_variations = generate_query_variations(query, 5)
    query_variations_projections = project_embeddings(collection._embedding_function(query_variations), umap_transform)

    original_relevant_docs = collection.query(
        query_texts=[query],
        n_results=num_results,
    )
    original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
    original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
    original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
    
    additional_relevant_docs = collection.query(
        query_texts=query_variations,
        n_results=num_results,
    )
    additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten 
    # remove duplicates
    additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
    # remove the original relevant docs from the additional relevant docs
    additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
    additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
    additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
    fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
    fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="query variations"))
    fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
    fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
    
    fig.show()
plot_multiquery_retrieval_results("Climate Change", selected_collection)
class MultiQueryRetriever(BaseRetriever):
    store: VectorStore
    num_additional_queries: int = 3
    num_results: int = 3

    def _get_query_variations(self, query: str) -> List[str]:
       return generate_query_variations(query, self.num_additional_queries)

    def _get_relevant_documents(
        self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        queries = self._get_query_variations(original_query)
        queries.append(original_query)
        retriever = store_to_retriever(self.store, k=self.num_results)
        relevant_docs = []
        for query in queries:
            results = retriever.invoke(query, run_manager=run_manager)
            # remove duplicates
            for res in results:
                if res not in relevant_docs:
                    relevant_docs.append(res)
        return relevant_docs
multiquery_retriever = MultiQueryRetriever(store=best_store, num_additional_queries=3, num_results=3)
multiquery_chain = create_qa_chain(multiquery_retriever)
multiquery_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing what’ s known as the β€œ vapor pressure deficit, ” or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isn’ t the only factor behind the west’ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 5661, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='Blue River, Vida, Phoenix, and Talentβ€”were lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 5660, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='parts of Washington, Oregon, Idaho, and Nevada. Some scientists are also raising concerns that all the young grasses and other plants that have sprung up as a result of the wet weather could quickly turn into dry kindling for wildfires as the dry season wears on into late summer and fall. According to the latest wildland fire outlook, most of the western United States is expected to experience either normal or below-normal fire activity between May and August this year. Source: National Interagency Fire Center. There are many different ways to measure wildfire activity, but by almost any metric, wildfires across the western US and southwestern Canada are worsening. Reliable, consistent wildfire metrics across the region started to become available in the mid-1980s. Here’ s what the trends show. From 1984 to 1999, the region experienced an average of roughly 230 fires per year. From 2000 to 2021, the average was more than 350 fires per year. The number of wildfires larger than 1,000 acres in western North', metadata={'domain': 'cleantechnica', 'id': 5655, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This year’ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 5662, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
 'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
 'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
datasets["multiquery"] = get_or_create_eval_dataset("multiquery", eval_df, multiquery_chain)
Loaded multiquery dataset from data\gold\datasets\multiquery_dataset.json
llm_results["multiquery"] = get_or_run_llm_eval("multiquery", datasets["multiquery"], llm)
Loaded multiquery evaluation results from data\gold\results\multiquery_llm_eval_results.csv
strategy_results = {}
strategy_results["gte_recursive_1024"] = llm_results["gte_recursive_1024"]
strategy_results["multiquery"] = llm_results["multiquery"]
plot_multiple_evals(strategy_results)
No description has been provided for this image No description has been provided for this image

We can see that on average the answer correctness does slightly increase when using multi-querying. This is likely due to the fact that the retrieval process is more robust and can capture a broader range of relevant information. However, the faithfullness and context_relevancy decrease could be due to the multi-querying introducing more noise into the retrieval process by retrieving more chunks in general and some of them being less relevant.

HyDE - Hypothetical Document EmbeddingsΒΆ

The idea of the HyDE method is to generate hypothetical documents that are similar to the user query and then retrieve the most similar chunks to these hypothetical documents. This can be useful when the user query is not very specific or when the user query is not very similar to the chunks. The HyDE method can be used to generate hypothetical documents that are more similar to the chunks and therefore improve the retrieval process. Another way to think about it is generating a hypothetical answer and therby reaching an area in the embedding space that is more similar to the actual answer which might not be reachable from the user query.

hyde

def generate_hypothetical_document(query: str, num_hypotheses: int) -> List[str]:
    hyde_prompt = """Please write a news passage about the topic.
    Topic: {query}
    Passage:
    """

    hyde_chain = ChatPromptTemplate.from_template(hyde_prompt) | llm
    hypothetical_documents = [hyde_chain.invoke({"query": query}).content for _ in range(num_hypotheses)]
    return hypothetical_documents
def plot_hyde_retrieval_results(query: str, collection : Collection, num_hypo_documents: int = 2, num_results: int = 3):
    vectors = get_vectors_from_collection(collection)
    umap_transform = fit_umap(vectors)
    vectors_projections = project_embeddings(vectors, umap_transform)

    query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)

    hypothetical_documents = generate_hypothetical_document(query, num_hypo_documents)
    query_variations_projections = project_embeddings(collection._embedding_function(hypothetical_documents), umap_transform)

    original_relevant_docs = collection.query(
        query_texts=[query],
        n_results=num_results,
    )
    original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
    original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
    original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
    
    additional_relevant_docs = collection.query(
        query_texts=hypothetical_documents,
        n_results=num_results,
    )
    additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten 
    # remove duplicates
    additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
    # remove the original relevant docs from the additional relevant docs
    additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
    additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
    additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
    fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
    fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="hypothetical documents"))
    fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
    fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
    
    fig.show()
plot_hyde_retrieval_results("Climate Change", selected_collection)
class HyDERetriever(BaseRetriever):
    store: VectorStore
    num_hypo_documents: int = 2
    num_results: int = 3

    def _get_hypothetical_documents(self, query: str) -> List[str]:
        return generate_hypothetical_document(query, self.num_hypo_documents)

    def _get_relevant_documents(
        self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        hypothetical_documents = self._get_hypothetical_documents(original_query)
        hypothetical_documents.append(original_query)
        retriever = store_to_retriever(self.store, k=self.num_results)
        relevant_docs = []
        for query in hypothetical_documents:
            results = retriever.invoke(query, run_manager=run_manager)
            # remove duplicates
            for res in results:
                if res not in relevant_docs:
                    relevant_docs.append(res)
        return relevant_docs
hyde_retriever = HyDERetriever(store=best_store, k=3)
hyde_chain = create_qa_chain(hyde_retriever)
hyde_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='Blue River, Vida, Phoenix, and Talentβ€”were lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 5660, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing what’ s known as the β€œ vapor pressure deficit, ” or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isn’ t the only factor behind the west’ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 5661, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='Let’ s dive into western wildfires by the numbers. As spring turns to summer and the days warm up, the Northern Hemisphere enters the period known as Danger Season, when wildfires, heat waves, and hurricanes, all amplified by climate change, begin to ramp up. In the western United States, the start of Danger Season is marked by the shift from the wintertime wet season to the summertime dry season. While wildfires can and do occur all year round, this shift from cool and wet to warm and dry marks the start of wildfire season in the region. According to the latest seasonal outlook from the National Interagency Fire Center, the exceptionally rainy and snowy conditions the west experienced during the winter of 2022-2023 are translating to below-average to normal levels of wildfire risk across most western states at least through August. That said, above-normal activity is expected for parts of Washington, Oregon, Idaho, and Nevada. Some scientists are also raising concerns that all the young grasses and other', metadata={'domain': 'cleantechnica', 'id': 5654, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This year’ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 5662, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
 'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
 'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
datasets["hyde"] = get_or_create_eval_dataset("hyde", eval_df, hyde_chain)
Loaded hyde dataset from data\gold\datasets\hyde_dataset.json
llm_results["hyde"] = get_or_run_llm_eval("hyde", datasets["hyde"], llm)
Loaded hyde evaluation results from data\gold\results\hyde_llm_eval_results.csv
strategy_results["hyde"] = llm_results["hyde"]
plot_multiple_evals(strategy_results)
No description has been provided for this image No description has been provided for this image

Just like with multi-querying we can see that the answer correctness increases when using the HyDE method.

Other MethodsΒΆ

There are many other methods that can be used to improve the RAG pipeline. Some of these include:

os.system("jupyter nbconvert --to html --template pj cleantech_rag.ipynb")
0